Commit Graph

290 Commits

Author SHA1 Message Date
Bin Bao
aca0807101 [AOTI] Use random inputs to autotune the backward pass (#125291)
Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367.

In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291
Approved by: https://github.com/chenyang78, https://github.com/angelayi
2024-05-10 20:13:34 +00:00
Sun, Jiayi
82b7b59d2a [inductor] Check if n is the input tensor of conv_pointwise (#125119)
Fix https://github.com/pytorch/pytorch/issues/124837.
Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-05-08 13:25:49 +00:00
Shunting Zhang
48b6c8dbc3 [Inductor] log fusion failure due to index mismatch (#124986)
The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things
- index formula matches
- var_ranges matches

In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories
- rand_seed: the index for rand seed access is an integer and different access uses different integer offset
- different numel: this happens for cat operation
- broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B
- different loop orders: the major category we want inductor to be able to fuse
- different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset.
- unknown

My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2  ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 )

For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-07 02:29:00 +00:00
PyTorch MergeBot
7ffa5558ee Revert "[FX] Update type hints in torch.fx._compatibility.py (#125469)"
This reverts commit 235b4d6ec2.

Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))
2024-05-06 18:36:43 +00:00
Aaron Gokaslan
1dd42e42c4 [BE]: Try TCH autofixes on torch/ (#125536)
Tries TCH autofixes and see what breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536
Approved by: https://github.com/ezyang
2024-05-05 23:13:59 +00:00
Xuehai Pan
235b4d6ec2 [FX] Update type hints in torch.fx._compatibility.py (#125469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469
Approved by: https://github.com/Skylion007
ghstack dependencies: #125468
2024-05-05 19:30:22 +00:00
Edward Z. Yang
e5e623af4b Codegen runtime asserts in Inductor (#124874)
This completely subsumes https://github.com/pytorch/pytorch/pull/120816

This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain.

Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops!

So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point.

There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway.

**torch/_dynamo/output_graph.py** and **torch/_inductor/lowering.py** - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar`  entirely; we assume that ShapeEnv will be good enough to capture all of these.

**torch/_inductor/codegen/wrapper.py** and **torch/_inductor/ir.py** - Add a way to codegen sizevars without forcing simplification

**torch/_inductor/graph.py** - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes.

**torch/fx/experimental/symbolic_shapes.py** - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py

**torch/fx/passes/runtime_assert.py** - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this.

TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874
Approved by: https://github.com/jansel
ghstack dependencies: #124864
2024-04-29 10:19:29 +00:00
CaoE
555f1aeb02 Fix module buffer mutation (#124586)
Fixes #124583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-04-29 06:05:12 +00:00
chilli
7321005dd8 Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-26 01:02:28 +00:00
David Berard
4259e5d0e0 [inductor] Specialize on unguarded alignment of example inputs (#123319)
When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned.

In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone.

Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319
Approved by: https://github.com/eellison
2024-04-25 22:28:15 +00:00
Edward Z. Yang
b4597fffce Try to reuse old symbol name rather than new symbol name when renaming (#124782)
Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them.  Now, we do the replacement to preserve the old symbol.

Actually doing this is a bit tricky.  Here’s the order things happen when retracing data dependent:

1. Run fake tensor prop: allocate new unbacked SymInt
2. Run proxy tensor mode, calculate bindings and associate them with FX node
3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent

So the problem is when we calculate bindings in step (2), we don't know
what the original names are yet, we only find out later at (3).  But by
the time (3) runs, we've already stuffed some new bindings in
meta["unbacked_bindings"] and we don't know how to update them!  To fix
this, I introduce resolve_unbacked_bindings which post facto applies any
of the renamings we discovered in (3).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782
Approved by: https://github.com/lezcano
ghstack dependencies: #124310, #124314, #124316, #124394, #124739
2024-04-25 14:02:42 +00:00
PyTorch MergeBot
0ca1ff3dce Revert "Add support for capturing tensors with score_mod (#124444)"
This reverts commit 7c253a7776.

Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))
2024-04-25 10:56:38 +00:00
Edward Z. Yang
13ab24f192 Reimplement unbacked symbol bindings in Inductor (#124394)
This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and  **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item)
   * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
    * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
    * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all.
    * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending
4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
    * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes)
    * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too!  Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
    * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at  **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
    * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`.
    * **torch/fx/experimental/symbolic_shapes.py** - A few things
      * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
      * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass.
      * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394
Approved by: https://github.com/jansel
ghstack dependencies: #124310, #124314, #124316
2024-04-25 02:08:59 +00:00
eellison
68225072e8 Match insignificant strides for sdpa inputs (#124859)
Fix for https://github.com/pytorch/pytorch/issues/124289.

There was a tensor which had a single, expanded element. inductor generated the strides as all 0, while sdpa expects a dense last dimension `t.stride(-1) == 1`. While these are equivalent, we still hit an error in the kernel. We could make fixes in sdpa, but matching the insignificant strides in inductor also works and I am less aware of the downstream sdpa kernel details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124859
Approved by: https://github.com/drisspg
ghstack dependencies: #124751
2024-04-24 23:44:23 +00:00
chilli
7c253a7776 Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-23 17:54:08 +00:00
PyTorch MergeBot
4f3e1f1c93 Revert "Add support for capturing tensors with score_mod (#124444)"
This reverts commit e0c5113dec.

Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/malfet due to This is weird, but somehow profile test started to timeout after this PR, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=noGPU_AVX512 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2072506731))
2024-04-23 14:39:37 +00:00
chilli
e0c5113dec Add support for capturing tensors with score_mod (#124444)
```
import torch
from torch import nn
import torch.nn.functional as F
import torch._inductor.config as config
# torch.set_default_device('cuda')

import torch
from torch.nn.attention._templated_attention import _templated_attention as templated_attention
from triton.testing import do_bench
from torch.nn.attention import SDPBackend, sdpa_kernel

index = torch.ops.aten
torch.manual_seed(0)

B = 16
H = 16
S = 2048
D = 64

head_scale = torch.randn(H, device='cuda')
def alibi(score, batch, head, token_q, token_kv):
    return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv)
bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda')

query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)
value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16)

compiled = torch.compile(templated_attention)
out = compiled(query, key, value, score_mod=alibi)
out2 = templated_attention(query, key, value,score_mod=alibi)
print((out - out2).abs().mean())
assert (out - out2).abs().mean() < 1e-3
print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value)))
print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias)))
print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi)))
```
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-04-23 06:20:13 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
eellison
e4f6340f21 realize inputs to mem bound mm decomposition (#123165)
Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165
Approved by: https://github.com/jackiexu1992
2024-04-18 23:10:04 +00:00
Sun, Jiayi
1f04c29be5 [inductor] Freeze the layout of the conv input to channels_last (#122765)
Fix https://github.com/pytorch/pytorch/issues/118082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-04-18 02:23:38 +00:00
Xuehai Pan
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
Shunting Zhang
fb6f6270d6 [inductor] comprehensive padding (#120758)
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758
Approved by: https://github.com/jansel
2024-04-15 19:05:51 +00:00
Jason Ansel
6022600cc6 [inductor] Handle meta tensor ops in graph (#123786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123786
Approved by: https://github.com/anijain2305
ghstack dependencies: #123700, #123705
2024-04-12 19:03:13 +00:00
angelayi
493478db4a [effects] Add inductor support for tokens (#122347)
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_
        _print = torch.ops.aten._print('moo')
        res = l_x_ + l_x_;  l_x_ = None
        _print_1 = torch.ops.aten._print('moo')
        return (res,)
```

AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
        with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo');  arg0_1 = None
        getitem: "f32[0]" = with_effects[0];  with_effects = None
        add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
        with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
        getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
        return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
   def forward(self, arg1_1: "f32[2, 3]"):
       _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
       with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo');  _make_dep_token_default = None
       getitem: "f32[0]" = with_effects[0];  with_effects = None
       add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
       with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
       getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
       _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,));  getitem_2 = None
       return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    # Source Nodes: [_print], Original ATen: []
    buf2 = aten._print.default('moo')
    # Source Nodes: [_print_1], Original ATen: []
    buf3 = aten._print.default('moo')
    buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, buf4)
    del arg1_1
    return (buf4, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
2024-04-09 03:22:32 +00:00
Yang Chen
e4e5449dfc [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-08 16:51:43 +00:00
Adnan Akhundov
63c221b7fa Clone mutated inputs in first pass of CPP wrapper compilation (#123316)
Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs.

In this PR, we clone mutated inputs in the first pass to avoid input mutation.

Fixes https://github.com/pytorch/pytorch/issues/117364.

Test Plan:

```
$ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda
...
.
----------------------------------------------------------------------
Ran 1 test in 6.368s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316
Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire
2024-04-05 21:47:19 +00:00
xinan.lin
9743e3a19c [Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895)
As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by:
- Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides
- Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation.

For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing.
We'll reuse all the existing Inductor test case in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-04-05 09:05:11 +00:00
Andrew M. James
bde1a93bc4 Add lowering for resize, decomp for resize_as. (#122317)
This has been split off from #121354 as the inplace version of these
methods prove to be rather tricky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-04-03 17:47:29 +00:00
Bin Bao
0ff6155eee [AOTI] Support module buffer mutation (#123164)
Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164
Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov
2024-04-02 20:25:26 +00:00
PyTorch MergeBot
1f503dffb3 Revert "[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)"
This reverts commit 7eadb157bd.

Reverted https://github.com/pytorch/pytorch/pull/123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/123136#issuecomment-2032163699))
2024-04-02 14:17:03 +00:00
Yang Chen
7eadb157bd [aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136
Approved by: https://github.com/desertfire
2024-04-02 09:00:05 +00:00
PyTorch MergeBot
a236fa9f06 Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)"
This reverts commit 384de46395.

Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))
2024-03-29 17:52:39 +00:00
Yang Chen
384de46395 [aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882)
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882
Approved by: https://github.com/desertfire
2024-03-28 19:06:29 +00:00
Wang, Eikan
f8eeae7aaa Enable CPP wrapper codegen registration (#121296)
Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-26 06:51:03 +00:00
Adnan Akhundov
9223b2cb31 Pop codegened parent graph from wrapper in GraphLowering (#122469)
Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done.

Fixes https://github.com/pytorch/pytorch/issues/121887.

Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469
Approved by: https://github.com/eellison
2024-03-25 20:27:59 +00:00
Honglin Zhu
adeedc060f [Inductor] Fix unbacked symbol in stride when using item() (#122298)
Fixes #122296

Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298
Approved by: https://github.com/ezyang
2024-03-24 06:27:15 +00:00
chilli
d34514f8db Renamed mutationlayout/aliasedlayout (#122474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474
Approved by: https://github.com/jansel
ghstack dependencies: #121624
2024-03-22 08:32:14 +00:00
Adnan Akhundov
e419011471 [inductor] Add torch.while_loop support to JIT Inductor (#122069)
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).

AOT Inductor support will be added in a follow-up PR.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-22 02:45:27 +00:00
eellison
18c164ef7c [Inductor] Match insignficiant strides on outputs (#122239)
Fix for https://github.com/pytorch/pytorch/issues/116433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239
Approved by: https://github.com/Chillee
2024-03-21 05:35:59 +00:00
Mu-Chu Lee
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
Adnan Akhundov
3d089de851 Add torch.cond support to AOT Inductor (#121120)
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).

Notable additions:

- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).

- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.

- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).

- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.

Codegen examples from the new unit tests:

- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-07 22:39:57 +00:00
Xia, Weiwen
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
Jason Ansel
01ec8df6d8 [Compiled Autograd] Introduce BackwardState capture (#120382)
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)

We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
2024-02-28 20:36:47 +00:00
Sherlock Huang
3e8b56d362 [Inductor] Track constant's original_fqn mapping (#120524)
When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing.

```
        constants_info_[0].name = "L__self___w_pre";
	constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32);
	constants_info_[0].offset = 0;
	constants_info_[0].data_size = 64;
	constants_info_[0].from_folded = false;
	constants_info_[0].shape = {4, 4};
	constants_info_[0].stride = {4, 1};
	// constants_info_[0].original_fqn = "w_pre";   // this line is missing
```

Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling
an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing.

As a fix, I maintain AOTI's own mapping for constant tensor's fqn.

Differential Revision: D54097073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524
Approved by: https://github.com/chenyang78
2024-02-28 17:36:29 +00:00
Edward Z. Yang
1a1fc1047d Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
2024-02-28 01:01:41 +00:00
PyTorch MergeBot
f3dd2a544c Revert "Add structured trace logs (#120289)"
This reverts commit 9dfaef962c.

Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))
2024-02-27 19:49:05 +00:00
Edward Z. Yang
9dfaef962c Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
2024-02-27 00:04:23 +00:00
Yang Chen
b96ea097ee [aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391)
make WrapperCodeGen subclass names consistent with the
file names:

CppWrapperCodeGen -> CppWrapperCpu
CudaWrapperCodeGen -> CppWrapperCuda

Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391
Approved by: https://github.com/aakhundov
2024-02-23 10:41:50 +00:00
Adnan Akhundov
badf84bd6b [inductor] Add torch.cond support to JIT Inductor (#119759)
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.

In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.

Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-17 07:25:27 +00:00
Yang Chen
bc7f3efb09 [aot_inductor] move CppWrapperCodeGen into a separate file (#119871)
This reverts commit d8e319a961.

Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871
Approved by: https://github.com/albanD, https://github.com/khabinov
ghstack dependencies: #119870
2024-02-16 08:14:20 +00:00