pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
Sijia Chen	31bb65de19	[Inductor] Fix conditional codegen (#129492 ) Summary: We have the cache to guarantee the `sym` is codegen only once, see the following code ``` def ensure_size_computed(self, sym: sympy.Symbol): if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE): if sym in self.computed_sizes: return self.computed_sizes.add(sym) expr = V.graph.sizevars.inv_precomputed_replacements[sym] self.writeline( f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}" ) ``` However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of `undefined symbols`: P1441378833 To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen Test Plan: TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size Differential Revision: D58973730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492 Approved by: https://github.com/aakhundov	2024-07-08 05:33:47 +00:00
Peter Bell	fb078c20c1	[inductor] Separate Buffer and Operation into two concepts (#128893 ) Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893 Approved by: https://github.com/lezcano	2024-07-02 23:49:57 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
PyTorch MergeBot	f7708ffebb	Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 )" This reverts commit `52009068bc`. Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 `52009068bc`. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))	2024-06-26 15:46:17 +00:00
Bin Bao	52009068bc	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 ) Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen. Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378 Approved by: https://github.com/oulgen, https://github.com/chenyang78 ghstack dependencies: #129267	2024-06-26 13:53:27 +00:00
Colin Peppler	df51d0b623	[aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301 ) Encountered the following C++ compile error. ``` Declared in this scope; did you mean ‘std::max’? 619 \| auto var_5 = max(1, u0); ``` This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen. Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-24 15:23:05 +00:00
chilli	858fb05dac	Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188 Approved by: https://github.com/yifuwang	2024-06-22 02:57:44 +00:00
Bin Bao	62e5d045c0	[AOTI] Auto-tune Triton kernels in a seperate block (#129057 ) Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library. There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor: * Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation. * Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption. * Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while. This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before. Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-21 14:34:13 +00:00
Peter Bell	859fa183fe	BE: Use future annotations in inductor scheduler and ir (#128892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892 Approved by: https://github.com/lezcano	2024-06-20 00:10:43 +00:00
Bin Bao	ba92f5277f	[inductor][refactor] Unify the use of generate_kernel_call (#128467 ) Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack. Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467 Approved by: https://github.com/shunting314	2024-06-19 07:47:25 +00:00
Peter Bell	cd2ad29afe	[inductor] Reduce binding overhead of _reinterpret_tensor (#128185 ) Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead per call compared to `PyArgParser`. Note that views of inputs are reconstructed by AOTAutograd before being returned to the python code, so dispatching for autograd's sake shouldn't be required here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185 Approved by: https://github.com/lezcano ghstack dependencies: #128183, #128184	2024-06-09 23:33:03 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Mu-Chu Lee	d9696ea624	[AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574 ) Summary: 1. Integrate NaN and INF checker with existing config, controllable by env var. 2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check. 3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel. Test Plan: Debugging utility. Test and check by existing tests with env var: ``` TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda ``` Reviewed By: ColinPeppler Differential Revision: D57989176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-07 16:46:26 +00:00
Andrew M. James	9554300436	[inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195 ) [Triton #3762](https://github.com/triton-lang/triton/pull/3762) disallows access to globals which are not `tl.constexpr` Triton has always treated captured globals this way, but they now require it be explicit in user code. Updated codegen to make sure these variables are defined before writing the kernel source when compiling a user defined triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195 Approved by: https://github.com/alexbaden, https://github.com/bertmaher	2024-06-06 20:50:11 +00:00
Edward Z. Yang	a4064da8ca	Always simplify sympy expressions before printing. (#127543 ) This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away. Do this at the very end. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543 Approved by: https://github.com/lezcano	2024-06-03 20:36:14 +00:00
James Wu	63d7ffe121	Retry of D58015187 Move AsyncCompile to a different file (#127691 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/127545/files and D58015187, fixing the internal test that also imported codecache Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now Differential Revision: D58054611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691 Approved by: https://github.com/oulgen	2024-06-03 15:29:41 +00:00
PyTorch MergeBot	22f392ba40	Revert "[easy?] Move AsyncCompile to a different file (#127235 )" This reverts commit `f58fc16e8f`. Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))	2024-06-01 17:16:16 +00:00
Edward Z. Yang	806e6257f3	Unconditionally assign symbolic shapes as locals (#127486 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8493858177307906 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127486 Approved by: https://github.com/albanD	2024-05-31 20:01:44 +00:00
Bin Bao	413b81789f	[AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916 ) Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916 Approved by: https://github.com/chenyang78	2024-05-31 13:56:11 +00:00
James Wu	f58fc16e8f	[easy?] Move AsyncCompile to a different file (#127235 ) By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes. To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral	2024-05-30 02:43:02 +00:00
Bin Bao	72d30aa026	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-24 19:10:33 +00:00
Aaron Gokaslan	3cb16ebf08	[BE]: Update ruff to 0.4.5 (#126979 ) Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979 Approved by: https://github.com/ezyang	2024-05-24 18:38:35 +00:00
PyTorch MergeBot	f749c5def8	Revert "[AOTI] Fix an int array codegen issue (#126801 )" This reverts commit `ff617ab6c8`. Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
Bin Bao	ff617ab6c8	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-22 15:33:24 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00
Yueming Hao	acfe237a71	Fix C++ compilation error for tensor array in abi_compatible mode (#126412 ) Fixes #122048 There is a compilation error https://github.com/pytorch/pytorch/issues/122048 when the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array. This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412 Approved by: https://github.com/desertfire	2024-05-20 20:57:50 +00:00
angelayi	9e1826deff	[torchbind] Add inductor support (#123709 ) Example inductor generated python code: [P1245776497](https://www.internalfb.com/phabricator/paste/view/P1245776497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123709 Approved by: https://github.com/eellison	2024-05-13 18:18:17 +00:00
Kai Londenberg	10f673541e	[Inductor cutlass backend] Enabled nonzero workspace and Cutlass StreamK (#125406 ) Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125406 Approved by: https://github.com/jansel	2024-05-05 15:28:45 +00:00
Edward Z. Yang	6f70d22277	Extend torch.utils._sympy.symbol for more Inductor symbols (#125419 ) I'm still missing a few, cdzq at least Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125419 Approved by: https://github.com/lezcano ghstack dependencies: #125395	2024-05-04 09:05:00 +00:00
Sam Larsen	254128c16e	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Differential Revision: [D56723770](https://our.internmc.facebook.com/intern/diff/D56723770) Co-authored-by: Sam Larsen <slarsen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-30 16:54:16 +00:00
Edward Z. Yang	e5e623af4b	Codegen runtime asserts in Inductor (#124874 ) This completely subsumes https://github.com/pytorch/pytorch/pull/120816 This makes use of the unbacked binding machinery to teach Inductor how to generate deferred runtime asserts directly. There is some back story about why I did it this way, let me explain. Previously, our strategy for generating runtime asserts was that Dynamo would insert them into the FX graph after finishing tracing, and we would attempt to code generate them based on the FX graph. This is a good strategy for export, where we immediately export the graph. However, this strategy was afflicted by problems in eager, where we reuse the same ShapeEnv as before. In particular, on subsequent graph passes, we would immediately turn all of these assertions into noops, because when we evaluated their expressions, we would see that because we had a deferred runtime assert in the ShapeEnv, we know "oh, of course this expression is True" already. Oops! So, with this PR, we take the attitude that as long as the ShapeEnv sticks around, the ShapeEnv's list of deferred runtime asserts is the source of truth, and we don't put anything in the graph. So we just need to decide when to actually generate asserts, and the place I picked was Inductor lowering, since we already have an AssertScalar buffer concept, and so I just need to insert them at this point. AssertScalar also uses raw sympy.Expr rather than SymInt/Bool, so it is easier to prevent unrestricted simplification at this point. There are a few things jumbled together in this PR. I can split them if you want, but some of the changes are before I changed my strategy, but they're useful changes anyway. torch/_dynamo/output_graph.py and torch/_inductor/lowering.py - Here, we stop putting deferred runtime asserts in the graph. I also have to make sure we don't DCE unused symbol arguments; we're going to get some goofy graph arguments this way, will be good to restore that optimization eventually. We also just disable codegen for `_assert_scalar` entirely; we assume that ShapeEnv will be good enough to capture all of these. torch/_inductor/codegen/wrapper.py and torch/_inductor/ir.py - Add a way to codegen sizevars without forcing simplification torch/_inductor/graph.py - The main logic. Our strategy is to interpose in the same place we are testing that unbacked SymInts are properly showing up in lowered code. The logic is directly analogous to the logic in the existing insert deferred runtime asserts FX pass, but it's simpler because sympy expressions can be directly stored on inductor IR nodes. torch/fx/experimental/symbolic_shapes.py - For extra safety, we have a way of freezing runtime asserts, so that if you try to add more we error. This prevents us from adding runtime asserts after we've done lowering. There's a funny interaction with backwards which there's a comment for in graph.py torch/fx/passes/runtime_assert.py - This is not really needed in this PR, but I rewrote the runtime assert logic to use unbacked_bindings rather than inferring it by looking for unbacked SymInts. Now, keypaths are translated into FX node acessors. Unfortunately, I couldn't delete the old inference code, because you still need it to find backed SymInts from arguments (as this pass may be used on graphs which don't explicitly bind all their shape variables as argments). There are some new tests exercising this. TODO: I think we need to generate asserts for replacements too. This is a preexisting problem that the old FX pass had too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124874 Approved by: https://github.com/jansel ghstack dependencies: #124864	2024-04-29 10:19:29 +00:00
Aaron Gokaslan	49ca2b3429	[BE]: Apply RUF025 perf fixups (#125104 ) Uses `dict.fromkeys()` for more efficient dict construction. Automatically generated by RUF025 (prev). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125104 Approved by: https://github.com/ezyang	2024-04-28 15:09:21 +00:00
Edward Z. Yang	4c44e2b236	Improved unbacked SymInt input support in Inductor (#124739 ) This is a subset of changes extracted from https://github.com/pytorch/pytorch/pull/124683/ This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes https://github.com/pytorch/pytorch/issues/124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124739 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316, #124394	2024-04-25 13:29:53 +00:00
PyTorch MergeBot	f6ce94dca5	Revert "[inductor] Remove usage of device_interface from _inductor.runtime (#124592 )" This reverts commit `5d45eb77f1`. Reverted https://github.com/pytorch/pytorch/pull/124592 on behalf of https://github.com/jeanschmidt due to breaking internal tests, check D56522594 ([comment](https://github.com/pytorch/pytorch/pull/124592#issuecomment-2076957668))	2024-04-25 11:28:23 +00:00
Edward Z. Yang	13ab24f192	Reimplement unbacked symbol bindings in Inductor (#124394 ) This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item) * process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. _rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * torch/fx/experimental/symbolic_shapes.py - A few things * rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316	2024-04-25 02:08:59 +00:00
Jason Ansel	5d45eb77f1	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-23 17:51:25 +00:00
Bin Bao	bb37910e30	[AOTI] Fixes ScatterFallback codegen (#124580 ) Summary: For https://github.com/pytorch/pytorch/issues/123184. ScatterFallback currently relies on op name matching for codegen, which makes its cpp codegen fragile. Refactor to use op_overload and fix the relevant unit test failures. Differential Revision: [D56417815](https://our.internmc.facebook.com/intern/diff/D56417815) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124580 Approved by: https://github.com/chenyang78	2024-04-22 20:47:26 +00:00
Jason Ansel	0093735ccd	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 18:46:40 +00:00
Jason Ansel	480585fd2b	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 18:41:12 +00:00
PyTorch MergeBot	16eea7c6a5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 )" This reverts commit `a7035cc11a`. Reverted https://github.com/pytorch/pytorch/pull/124552 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
PyTorch MergeBot	30dec1da84	Revert "[inductor] Use compile time config values in runtime (#124561 )" This reverts commit `3af12447f8`. Reverted https://github.com/pytorch/pytorch/pull/124561 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124561#issuecomment-2070537634))	2024-04-22 18:24:38 +00:00
Jason Ansel	3af12447f8	[inductor] Use compile time config values in runtime (#124561 ) This removes usage of torch._inductor.config from `torch._inductor.runtime`. Fixing two issues: 1) If configs change we should really use the compile time ones 2) In compile workers, we want to use the parent process config Pull Request resolved: https://github.com/pytorch/pytorch/pull/124561 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553, #124557, #124559, #124560, #124569	2024-04-22 04:51:30 +00:00
Jason Ansel	a7035cc11a	[inductor] Refactor runtime files into torch._inductor.runtime (part 1) (#124552 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124552 Approved by: https://github.com/yanboliang	2024-04-22 04:51:05 +00:00
Edward Z. Yang	afa78ad08c	Call writeline from writelines (#124515 ) This makes it more convenient to add a breakpoint here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515 Approved by: https://github.com/albanD	2024-04-20 15:45:30 +00:00
Zhuoran Zhao	b0d83726bd	[5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241 ) Summary: as title Test Plan: CI & unit test patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test Differential Revision: D56223917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-19 18:57:38 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00

1 2 3 4 5 ...

378 Commits