pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Bin Bao	aca0807101	[AOTI] Use random inputs to autotune the backward pass (#125291 ) Summary: This is for JIT Inductor with cpp wrapper, fixing https://github.com/pytorch/pytorch/issues/117367. In the backward pass, we don't have real inputs to execute the backward pass to autotune kernels. We have 3 options here, 1) use random tensor inputs; 2) store the forward outputs and feed them to backward (non-trivial because of parameter re-ordering); 3) autotune each kernel with random inputs in a subprocess (similar to select_algorithm). This PR uses the easist option 1. Option 3 is where we are going as the next step, which will simplify the cpp wrapper codegen for the CUDA backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125291 Approved by: https://github.com/chenyang78, https://github.com/angelayi	2024-05-10 20:13:34 +00:00
Sun, Jiayi	82b7b59d2a	[inductor] Check if n is the input tensor of conv_pointwise (#125119 ) Fix https://github.com/pytorch/pytorch/issues/124837. Check whether n is the input tensor of convolution_pointwise or qconv2d_pointwise, if so freeze the layout to channels_last. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125119 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-05-08 13:25:49 +00:00
Shunting Zhang	48b6c8dbc3	[Inductor] log fusion failure due to index mismatch (#124986 ) The scheduler searches for fusion opportunities by looking for common memory access. Two memory access are considered common not only when the buffer name match, but it also requires more things - index formula matches - var_ranges matches In this PR, I want to log all the fusion failures due to mismatch index formula or var_ranges. I also want to further categories the failures. Right now I found the following failure categories - rand_seed: the index for rand seed access is an integer and different access uses different integer offset - different numel: this happens for cat operation - broadcast: e.g. kernel A write a buffer which is broadcasted and read by kernel B - different loop orders: the major category we want inductor to be able to fuse - different offset: happens when use a concatenated linear layer to project Q/K/V and then split the result. Each split will point to the same buffer with different offset. - unknown My hope is to make sure for the models I tested, there is no fusion failure falling in the unknown category so all the failures are well understood and categories. Right now it's true for BertForMaskedLM ( https://gist.github.com/shunting314/6dc2c903629d342fa63ba731a171adc2 ), DistillGPT2 ( https://gist.github.com/shunting314/145176f2e850103c7fad4ad72f0e200e ) and llm.c ( https://gist.github.com/shunting314/cfc64a326312a889ba55f79bd47b2082 ) For BertForMaskedLM, we found 82 instances of fusion failures and majority of them are due to different loop orders! Studying the log a bit more can help us figure out where all these loop order mismatch comes from in real models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124986 Approved by: https://github.com/eellison, https://github.com/jansel	2024-05-07 02:29:00 +00:00
PyTorch MergeBot	7ffa5558ee	Revert "[FX] Update type hints in `torch.fx._compatibility.py` (#125469 )" This reverts commit `235b4d6ec2`. Reverted https://github.com/pytorch/pytorch/pull/125469 on behalf of https://github.com/izaitsevfb due to breaks pyre in dependent projects (internal: see D56986361) ([comment](https://github.com/pytorch/pytorch/pull/125469#issuecomment-2096665396))	2024-05-06 18:36:43 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Xuehai Pan	235b4d6ec2	[FX] Update type hints in `torch.fx._compatibility.py` (#125469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125469 Approved by: https://github.com/Skylion007 ghstack dependencies: #125468	2024-05-05 19:30:22 +00:00
Edward Z. Yang	e5e623af4b	Codegen runtime asserts in Inductor (#124874 ) This completely subsumes https://github.com/pytorch/pytorch/pull/120816 This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain. Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops! So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point. There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway. torch/_dynamo/output_graph.py and torch/_inductor/lowering.py - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these. torch/_inductor/codegen/wrapper.py and torch/_inductor/ir.py - Add a way to codegen sizevars without forcing simplification torch/_inductor/graph.py - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes. torch/fx/experimental/symbolic_shapes.py - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py torch/fx/passes/runtime_assert.py - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this. TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874 Approved by: https://github.com/jansel ghstack dependencies: #124864	2024-04-29 10:19:29 +00:00
CaoE	555f1aeb02	Fix module buffer mutation (#124586 ) Fixes #124583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124586 Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire	2024-04-29 06:05:12 +00:00
chilli	7321005dd8	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Differential Revision: [D56583900](https://our.internmc.facebook.com/intern/diff/D56583900) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-26 01:02:28 +00:00
David Berard	4259e5d0e0	[inductor] Specialize on unguarded alignment of example inputs (#123319 ) When inductor generates triton code, the triton code can either assume that the inputs given to it are aligned or unaligned. If they are aligned, triton can use more efficient instructions (like vectorized loads or tensor cores). However, if we generate "aligned" code and pass in unaligned inputs, the triton code will error out; to fix this, we clone unaligned inputs that are passed to triton kernels that expect aligned inputs. This can lead to excessive clones if we have inputs that are not expected to be aligned. In this PR, we use the example input to decide whether the generated triton code should assume alignment or not. If the example input is aligned, then we will generate triton code that assumes alignment; if at runtime we receive an unaligned input, we'll make a clone. Meanwhile, if the example input is not aligned, the generated triton code will not assume inputs are aligned and we won't ever need to clone. Note that the alignment of the inputs is not guarded on; we found that adding guards on tensor offsets (a) was slow in cases where we do a lot of comparisons on tensor offsets, and (b) led to a lot of recompilations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123319 Approved by: https://github.com/eellison	2024-04-25 22:28:15 +00:00
Edward Z. Yang	b4597fffce	Try to reuse old symbol name rather than new symbol name when renaming (#124782 ) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739	2024-04-25 14:02:42 +00:00
PyTorch MergeBot	0ca1ff3dce	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit `7c253a7776`. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, check D56522566 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2076908582))	2024-04-25 10:56:38 +00:00
Edward Z. Yang	13ab24f192	Reimplement unbacked symbol bindings in Inductor (#124394 ) This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item) * process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. _rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * torch/fx/experimental/symbolic_shapes.py - A few things * rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316	2024-04-25 02:08:59 +00:00
eellison	68225072e8	Match insignificant strides for sdpa inputs (#124859 ) Fix for https://github.com/pytorch/pytorch/issues/124289. There was a tensor which had a single, expanded element. inductor generated the strides as all 0, while sdpa expects a dense last dimension `t.stride(-1) == 1`. While these are equivalent, we still hit an error in the kernel. We could make fixes in sdpa, but matching the insignificant strides in inductor also works and I am less aware of the downstream sdpa kernel details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124859 Approved by: https://github.com/drisspg ghstack dependencies: #124751	2024-04-24 23:44:23 +00:00
chilli	7c253a7776	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 17:54:08 +00:00
PyTorch MergeBot	4f3e1f1c93	Revert "Add support for capturing tensors with score_mod (#124444 )" This reverts commit `e0c5113dec`. Reverted https://github.com/pytorch/pytorch/pull/124444 on behalf of https://github.com/malfet due to This is weird, but somehow profile test started to timeout after this PR, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=noGPU_AVX512 ([comment](https://github.com/pytorch/pytorch/pull/124444#issuecomment-2072506731))	2024-04-23 14:39:37 +00:00
chilli	e0c5113dec	Add support for capturing tensors with score_mod (#124444 ) ``` import torch from torch import nn import torch.nn.functional as F import torch._inductor.config as config # torch.set_default_device('cuda') import torch from torch.nn.attention._templated_attention import _templated_attention as templated_attention from triton.testing import do_bench from torch.nn.attention import SDPBackend, sdpa_kernel index = torch.ops.aten torch.manual_seed(0) B = 16 H = 16 S = 2048 D = 64 head_scale = torch.randn(H, device='cuda') def alibi(score, batch, head, token_q, token_kv): return score + torch.ops.aten.index(head_scale, [head]) * (token_q - token_kv) bias = torch.randn(H, S, S, dtype=torch.float16, device='cuda') query = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) key = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) value = torch.randn(B, H, S, D, device="cuda", dtype=torch.float16) compiled = torch.compile(templated_attention) out = compiled(query, key, value, score_mod=alibi) out2 = templated_attention(query, key, value,score_mod=alibi) print((out - out2).abs().mean()) assert (out - out2).abs().mean() < 1e-3 print("Flash (no mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value))) print("Flash (mask): ", do_bench(lambda: F.scaled_dot_product_attention(query, key, value, attn_mask=bias))) print("flexattention: ", do_bench(lambda: compiled(query, key, value, score_mod=alibi))) ``` <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/18c175d0-2720-4dfd-8747-85b8a8f609f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124444 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-04-23 06:20:13 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
eellison	e4f6340f21	realize inputs to mem bound mm decomposition (#123165 ) Differential Revision: [D55639709](https://our.internmc.facebook.com/intern/diff/D55639709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123165 Approved by: https://github.com/jackiexu1992	2024-04-18 23:10:04 +00:00
Sun, Jiayi	1f04c29be5	[inductor] Freeze the layout of the conv input to channels_last (#122765 ) Fix https://github.com/pytorch/pytorch/issues/118082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:23:38 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Shunting Zhang	fb6f6270d6	[inductor] comprehensive padding (#120758 ) This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120758 Approved by: https://github.com/jansel	2024-04-15 19:05:51 +00:00
Jason Ansel	6022600cc6	[inductor] Handle meta tensor ops in graph (#123786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123786 Approved by: https://github.com/anijain2305 ghstack dependencies: #123700, #123705	2024-04-12 19:03:13 +00:00
angelayi	493478db4a	[effects] Add inductor support for tokens (#122347 ) Given the following code/dynamo graph: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ _print = torch.ops.aten._print('moo') res = l_x_ + l_x_; l_x_ = None _print_1 = torch.ops.aten._print('moo') return (res,) ``` AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output: ``` class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"): with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo'); arg0_1 = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None return (getitem_2, add) ``` However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators. This has to be done after the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph. ``` class <lambda>(torch.nn.Module): def forward(self, arg1_1: "f32[2, 3]"): _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default() with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo'); _make_dep_token_default = None getitem: "f32[0]" = with_effects[0]; with_effects = None add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1); arg1_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo'); getitem = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,)); getitem_2 = None return (add,) ``` When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like: ``` def call(args): arg1_1, = args args.clear() assert_size_stride(arg1_1, (2, 3), (3, 1)) # Source Nodes: [_print], Original ATen: [] buf2 = aten._print.default('moo') # Source Nodes: [_print_1], Original ATen: [] buf3 = aten._print.default('moo') buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32) cpp_fused_add_0(arg1_1, buf4) del arg1_1 return (buf4, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347 Approved by: https://github.com/bdhirsh	2024-04-09 03:22:32 +00:00
Yang Chen	e4e5449dfc	[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136 Approved by: https://github.com/desertfire	2024-04-08 16:51:43 +00:00
Adnan Akhundov	63c221b7fa	Clone mutated inputs in first pass of CPP wrapper compilation (#123316 ) Summary: CPP wrapper compilation is currently done in two passes: in the first pass, Python wrapper is generated and run to compile Triton kernels as a side effect, in the second pass C++ wrapper is generated and compiled. When model inputs are mutated, running the Python wrapper in the first pass mutates the inputs, although the first pass (including the Python wrapper run) is strictly a part of the compilation process, hence must not introduce any side effects on the example inputs. In this PR, we clone mutated inputs in the first pass to avoid input mutation. Fixes https://github.com/pytorch/pytorch/issues/117364. Test Plan: ``` $ TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k test_inductor_layout_optimization_input_mutations_cuda ... . ---------------------------------------------------------------------- Ran 1 test in 6.368s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123316 Approved by: https://github.com/jansel, https://github.com/chenyang78, https://github.com/desertfire	2024-04-05 21:47:19 +00:00
xinan.lin	9743e3a19c	[Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895 ) As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by: - Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides - Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation. For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing. We'll reuse all the existing Inductor test case in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-04-05 09:05:11 +00:00
Andrew M. James	bde1a93bc4	Add lowering for resize, decomp for resize_as. (#122317 ) This has been split off from #121354 as the inplace version of these methods prove to be rather tricky. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122317 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-04-03 17:47:29 +00:00
Bin Bao	0ff6155eee	[AOTI] Support module buffer mutation (#123164 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164 Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov	2024-04-02 20:25:26 +00:00
PyTorch MergeBot	1f503dffb3	Revert "[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 )" This reverts commit `7eadb157bd`. Reverted https://github.com/pytorch/pytorch/pull/123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/123136#issuecomment-2032163699))	2024-04-02 14:17:03 +00:00
Yang Chen	7eadb157bd	[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation (#123136 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123136 Approved by: https://github.com/desertfire	2024-04-02 09:00:05 +00:00
PyTorch MergeBot	a236fa9f06	Revert "[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882 )" This reverts commit `384de46395`. Reverted https://github.com/pytorch/pytorch/pull/122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/122882#issuecomment-2027544640))	2024-03-29 17:52:39 +00:00
Yang Chen	384de46395	[aoti] clear precomputed symbol replacements before cpp wrapper compilation (#122882 ) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122882 Approved by: https://github.com/desertfire	2024-03-28 19:06:29 +00:00
Wang, Eikan	f8eeae7aaa	Enable CPP wrapper codegen registration (#121296 ) Extend code gen registration for `CppWrapper`. W/ this PR, an new backend can register its specific `CppWrapper` at runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121296 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-26 06:51:03 +00:00
Adnan Akhundov	9223b2cb31	Pop codegened parent graph from wrapper in GraphLowering (#122469 ) Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done. Fixes https://github.com/pytorch/pytorch/issues/121887. Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469 Approved by: https://github.com/eellison	2024-03-25 20:27:59 +00:00
Honglin Zhu	adeedc060f	[Inductor] Fix unbacked symbol in stride when using item() (#122298 ) Fixes #122296 Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298 Approved by: https://github.com/ezyang	2024-03-24 06:27:15 +00:00
chilli	d34514f8db	Renamed mutationlayout/aliasedlayout (#122474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474 Approved by: https://github.com/jansel ghstack dependencies: #121624	2024-03-22 08:32:14 +00:00
Adnan Akhundov	e419011471	[inductor] Add torch.while_loop support to JIT Inductor (#122069 ) Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file). AOT Inductor support will be added in a follow-up PR. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 38 tests in 159.387s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-22 02:45:27 +00:00
eellison	18c164ef7c	[Inductor] Match insignficiant strides on outputs (#122239 ) Fix for https://github.com/pytorch/pytorch/issues/116433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239 Approved by: https://github.com/Chillee	2024-03-21 05:35:59 +00:00
Mu-Chu Lee	7676433012	[AOTInductor] Reuse generated kernels between constant graph and main graph (#121564 ) Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated. Test Plan: Included in commit Differential Revision: D54706767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-03-11 22:44:38 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Xia, Weiwen	83d848e1c7	[Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605 ) description Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear. The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case. This feature is targeting PyTorch 2.3 release. Test plan ``` python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear ``` Performance before and after lowering `choose_qparam` to Inductor Before - latency for shape (32, 32) = 0.151 ms latency for shape (128, 128) = 0.153 ms latency for shape (1024, 1024) = 0.247 ms After - latency for shape (32, 32) = 0.049 ms - latency for shape (128, 128) = 0.052 ms - latency for shape (1024, 1024) = 0.133 ms Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-02 05:11:17 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Sherlock Huang	3e8b56d362	[Inductor] Track constant's original_fqn mapping (#120524 ) When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing. ``` constants_info_[0].name = "L__self___w_pre"; constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32); constants_info_[0].offset = 0; constants_info_[0].data_size = 64; constants_info_[0].from_folded = false; constants_info_[0].shape = {4, 4}; constants_info_[0].stride = {4, 1}; // constants_info_[0].original_fqn = "w_pre"; // this line is missing ``` Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing. As a fix, I maintain AOTI's own mapping for constant tensor's fqn. Differential Revision: D54097073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524 Approved by: https://github.com/chenyang78	2024-02-28 17:36:29 +00:00
Edward Z. Yang	1a1fc1047d	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007 ghstack dependencies: #120712	2024-02-28 01:01:41 +00:00
PyTorch MergeBot	f3dd2a544c	Revert "Add structured trace logs (#120289 )" This reverts commit `9dfaef962c`. Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))	2024-02-27 19:49:05 +00:00
Edward Z. Yang	9dfaef962c	Add structured trace logs (#120289 ) Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit How to read the diff: * Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes) * torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs * torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines. * torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log. * test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable. https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs. Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289 Approved by: https://github.com/Skylion007	2024-02-27 00:04:23 +00:00
Yang Chen	b96ea097ee	[aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391 ) make WrapperCodeGen subclass names consistent with the file names: CppWrapperCodeGen -> CppWrapperCpu CudaWrapperCodeGen -> CppWrapperCuda Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391 Approved by: https://github.com/aakhundov	2024-02-23 10:41:50 +00:00
Adnan Akhundov	badf84bd6b	[inductor] Add torch.cond support to JIT Inductor (#119759 ) Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them. In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs. Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 24 tests in 86.790s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759 Approved by: https://github.com/jansel, https://github.com/eellison	2024-02-17 07:25:27 +00:00
Yang Chen	bc7f3efb09	[aot_inductor] move CppWrapperCodeGen into a separate file (#119871 ) This reverts commit `d8e319a961`. Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871 Approved by: https://github.com/albanD, https://github.com/khabinov ghstack dependencies: #119870	2024-02-16 08:14:20 +00:00

1 2 3 4 5 ...

290 Commits