pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	3cb16ebf08	[BE]: Update ruff to 0.4.5 (#126979 ) Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979 Approved by: https://github.com/ezyang	2024-05-24 18:38:35 +00:00
PyTorch MergeBot	f749c5def8	Revert "[AOTI] Fix an int array codegen issue (#126801 )" This reverts commit `ff617ab6c8`. Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
Bin Bao	ff617ab6c8	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-22 15:33:24 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00
Yueming Hao	acfe237a71	Fix C++ compilation error for tensor array in abi_compatible mode (#126412 ) Fixes #122048 There is a compilation error https://github.com/pytorch/pytorch/issues/122048 when the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array. This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412 Approved by: https://github.com/desertfire	2024-05-20 20:57:50 +00:00
angelayi	9e1826deff	[torchbind] Add inductor support (#123709 ) Example inductor generated python code: [P1245776497](https://www.internalfb.com/phabricator/paste/view/P1245776497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123709 Approved by: https://github.com/eellison	2024-05-13 18:18:17 +00:00
Kai Londenberg	10f673541e	[Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406 ) Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406 Approved by: https://github.com/jansel	2024-05-05 15:28:45 +00:00
Edward Z. Yang	6f70d22277	Extend torch.utils._sympy.symbol for more Inductor symbols (#125419 ) I'm still missing a few, cdzq at least Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419 Approved by: https://github.com/lezcano ghstack dependencies: #125395	2024-05-04 09:05:00 +00:00
Sam Larsen	254128c16e	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770) Co-authored-by: Sam Larsen <slarsen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-30 16:54:16 +00:00
Edward Z. Yang	e5e623af4b	Codegen runtime asserts in Inductor (#124874 ) This completely subsumes https://github.com/pytorch/pytorch/pull/120816 This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain. Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops! So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point. There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway. torch/_dynamo/output_graph.py and torch/_inductor/lowering.py - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these. torch/_inductor/codegen/wrapper.py and torch/_inductor/ir.py - Add a way to codegen sizevars without forcing simplification torch/_inductor/graph.py - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes. torch/fx/experimental/symbolic_shapes.py - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py torch/fx/passes/runtime_assert.py - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this. TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874 Approved by: https://github.com/jansel ghstack dependencies: #124864	2024-04-29 10:19:29 +00:00
Aaron Gokaslan	49ca2b3429	[BE]: Apply RUF025 perf fixups (#125104 ) Uses `dict.fromkeys()` for more efficient dict construction. Automatically generated by RUF025 (prev). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125104 Approved by: https://github.com/ezyang	2024-04-28 15:09:21 +00:00
Edward Z. Yang	4c44e2b236	Improved unbacked SymInt input support in Inductor (#124739 ) This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/ This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes https://github.com/pytorch/pytorch/issues/124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316, #124394	2024-04-25 13:29:53 +00:00
PyTorch MergeBot	f6ce94dca5	Revert "[inductor] Remove usage of device_interface from _inductor.runtime (#124592 )" This reverts commit `5d45eb77f1`. Reverted https://github.com/pytorch/pytorch/pull/124592 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124592#issuecomment-2076957668))	2024-04-25 11:28:23 +00:00
Edward Z. Yang	13ab24f192	Reimplement unbacked symbol bindings in Inductor (#124394 ) This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item) * process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. _rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * torch/fx/experimental/symbolic_shapes.py - A few things * rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316	2024-04-25 02:08:59 +00:00
Jason Ansel	5d45eb77f1	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-23 17:51:25 +00:00
Bin Bao	bb37910e30	[AOTI] Fixes ScatterFallback codegen (#124580 ) Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures. Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580 Approved by: https://github.com/chenyang78	2024-04-22 20:47:26 +00:00
Jason Ansel	0093735ccd	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 18:46:40 +00:00
Jason Ansel	480585fd2b	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 18:41:12 +00:00
PyTorch MergeBot	16eea7c6a5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 )" This reverts commit `a7035cc11a`. Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	30dec1da84	Revert "[inductor] Use compile time config values in runtime (#124561 )" This reverts commit `3af12447f8`. Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))	2024-04-22 18:24:38 +00:00
Jason Ansel	3af12447f8	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 04:51:30 +00:00
Jason Ansel	a7035cc11a	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 04:51:05 +00:00
Edward Z. Yang	afa78ad08c	Call writeline from writelines (#124515 ) This makes it more convenient to add a breakpoint here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515 Approved by: https://github.com/albanD	2024-04-20 15:45:30 +00:00
Zhuoran Zhao	b0d83726bd	[5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241 ) Summary: as title Test Plan: CI & unit test patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test Differential Revision: D56223917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-19 18:57:38 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Adnan Akhundov	03a05e791a	Don't add non-integer Triton kernel arg 1 to equal_to_1 (#123886 ) Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](`8c5e33c77e/python/triton/runtime/jit.py (L275)`). Here we restrict Inductor's `equal_to_1` in the same way. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg ... ---------------------------------------------------------------------- Ran 1 test in 6.528s OK $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg ... ---------------------------------------------------------------------- Ran 2 tests in 10.142s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886 Approved by: https://github.com/oulgen ghstack dependencies: #123703	2024-04-14 20:34:05 +00:00
Adnan Akhundov	3d9dc976ae	Handle unqualified imports in custom Triton kernels (#123703 ) Summary: If in a custom (user-written) Triton kernel an externally imported symbol is used directly, we need to codegen the corresponding import outside the kernel body in the Python wrapper. E.g., if the user code has this: ``` from triton.language.extra.cuda.libdevice import fast_dividef @triton.jit def my_kernel(...): ... x = fast_dividef(...) ... ``` The `from triton.language.extra.cuda.libdevice import fast_dividef` line needs to be carried over together with the `my_kernel` function. The PR adds this. Test Plan: ``` $ python test/inductor/test_triton_kernels.py ... ---------------------------------------------------------------------- Ran 464 tests in 113.512s OK ``` Differential Revision: [D55953241](https://our.internmc.facebook.com/intern/diff/D55953241) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123703 Approved by: https://github.com/jansel, https://github.com/oulgen	2024-04-11 23:49:25 +00:00
Brian Hirsh	134e56fa33	inductor: log unique id to match output_code to aot graphs (#118647 ) I found it helpful to be able to see, given some inductor output code, which AOT graph it came from. When you have large models with multiple graphs floating around this can be difficult, so I added the aot_config.aot_id to the printed inductor output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118647 Approved by: https://github.com/ezyang	2024-04-11 14:37:07 +00:00
Adnan Akhundov	c773913407	Add torch.while_loop support to AOT Inductor (#123586 ) Summary: Previously, `torch.while_loop` was supported only in JIT inductor (added in https://github.com/pytorch/pytorch/pull/122069). Here we extend the support to AOT Inductor. Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_while_loop ... ---------------------------------------------------------------------- Ran 24 tests in 129.236s OK (skipped=8) $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 50 tests in 136.199s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123586 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-04-09 22:53:10 +00:00
Bin Bao	064a650b63	[AOTI][refactor] Improve generate_extern_kernel_out's signature (#123351 ) Summary: Annotate types and make the names more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123351 Approved by: https://github.com/chenyang78 ghstack dependencies: #123346	2024-04-04 23:23:50 +00:00
ydwu4	a4035bea5c	[while_loop] support closures (#123018 ) We add an additional_inputs arguments to the HOP while_loop and rename the operands to carried_inputs based on offline discussion with @zou3519 . This allows us to support closures, parameters and buffers. The alternative is to pass the lifted inputs directly to outputs of body_fn. But since we want the body_fn's output to not aliasing input. We'll need to copy the inputs and remove the copies later. This is a bit more work to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123018 Approved by: https://github.com/aakhundov ghstack dependencies: #123217	2024-04-03 19:35:15 +00:00
Mu-Chu Lee	4b725e1619	[AOTInductor] Support quantized linear on CPU with fbgemm (#123069 ) Summary: Added support for quantized linear on CPU with fbgemm. Specifically, for torch.ops.quantized.linear_unpacked_dynamic_fp16, we decompose it into two steps, pack weight, and fbgemm's qlinear with packed weight. Test Plan: Included in commit. test_aot_inductor::test_quantized_linear Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D55577959](https://our.internmc.facebook.com/intern/diff/D55577959) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123069 Approved by: https://github.com/hl475	2024-04-01 09:15:05 +00:00
Edward Z. Yang	10bdf64427	Properly pexpr the actual sympy.Expression, don't repr it. (#122893 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122893 Approved by: https://github.com/albanD, https://github.com/desertfire, https://github.com/jansel	2024-03-29 06:40:19 +00:00
Mu-Chu Lee	091a24495b	[AOTInductor] Support use_runtime_constant_folding for CPU. (#122563 ) Summary: We allow CPU to use the config use_runtime_constant_folding. Changes include 1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly. 2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration. Test Plan: Activate tests that were deactivated for CPU before. Reviewed By: khabinov Differential Revision: D55234300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563 Approved by: https://github.com/chenyang78	2024-03-28 17:49:05 +00:00
Bin Bao	537cd66e73	[Inductor] Support custom op in JIT with cpp wrapper (#122554 ) Summary: To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python. https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code. Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-26 18:48:45 +00:00
Adnan Akhundov	9223b2cb31	Pop codegened parent graph from wrapper in GraphLowering (#122469 ) Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done. Fixes https://github.com/pytorch/pytorch/issues/121887. Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469 Approved by: https://github.com/eellison	2024-03-25 20:27:59 +00:00
chilli	4d8a3f8bb3	changed aliasing checks to properly recurse for computing last usage (#122444 ) Fixes https://github.com/pytorch/pytorch/issues/122457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444 Approved by: https://github.com/yifuwang, https://github.com/jansel ghstack dependencies: #121624, #122474	2024-03-23 01:43:21 +00:00
chilli	d34514f8db	Renamed mutationlayout/aliasedlayout (#122474 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474 Approved by: https://github.com/jansel ghstack dependencies: #121624	2024-03-22 08:32:14 +00:00
Adnan Akhundov	e419011471	[inductor] Add torch.while_loop support to JIT Inductor (#122069 ) Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file). AOT Inductor support will be added in a follow-up PR. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 38 tests in 159.387s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069 Approved by: https://github.com/jansel, https://github.com/eellison	2024-03-22 02:45:27 +00:00
Kai Londenberg	2cd0a5d516	[Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808 ) There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none, where the return value of a call to V.graph._shape_env._maybe_evaluate_static is being discarded. This fix changes that to work how it was likely intended to. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808 Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov	2024-03-21 21:15:32 +00:00
Adnan Akhundov	456b112dca	[inductor] Support non-Tensor predicate in torch.cond (#122378 ) Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export. Test Plan: ``` $ python test/inductor/test_control_flow.py ... ---------------------------------------------------------------------- Ran 34 tests in 56.980s OK $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 54 tests in 460.093s OK (skipped=4) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-21 14:35:01 +00:00
Oguz Ulgen	c0b2e56c8f	Support triton.language.dtype with torch.compile -- Second Attempt (#122141 ) This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141 Approved by: https://github.com/jansel ghstack dependencies: #122140	2024-03-19 19:40:52 +00:00
Oguz Ulgen	7c5e29ae71	Back out "Support `triton.language.dtype` with `torch.compile` (#121690 )" (#122108 ) Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate. Test Plan: CI Differential Revision: D55024877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108 Approved by: https://github.com/ezyang	2024-03-18 20:50:28 +00:00
Bin Bao	818b14025a	[AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523 ) Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu. Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523 Approved by: https://github.com/chenyang78	2024-03-14 22:05:38 +00:00
Oguz Ulgen	79ee6bbde3	Support `triton.language.dtype` with `torch.compile` (#121690 ) Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work. ``` (Pdb) p triton.language.float32 triton.language.fp32 (Pdb) p str(triton.language.float32) 'fp32' (Pdb) p repr(triton.language.float32) 'triton.language.fp32' ``` This means that we need to "rewrite" them for fx graph and inductor execution. This PR allows Mamba2 to work with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690 Approved by: https://github.com/Skylion007	2024-03-12 23:21:46 +00:00
Peter Bell	459c5bca58	[inductor] Refactor common triton imports into one function (#121438 ) This means when codegen depends on a particular import we only need to add it in one place and it's applied to all triton kernels. This also changes codegen slightly so instead of generating `@pointwise` we now generate `@triton_heuristics.pointwise` just so the imports are the same for all kernel types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438 Approved by: https://github.com/lezcano	2024-03-09 18:17:36 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Oguz Ulgen	18d574a07a	[Inductor] Use indices for constants in triton_meta (#121427 ) @bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true `392370b303/python/triton/runtime/jit.py (L381-L385)` I'm guessing both indices and names work here but lets be consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427 Approved by: https://github.com/aakhundov	2024-03-07 21:59:43 +00:00
Bin Bao	7e598c0053	[Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309 ) Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309 Approved by: https://github.com/chenyang78	2024-03-07 14:22:06 +00:00

1 2 3 4 5 ...

354 Commits