pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
Isuru Fernando	978faf1fa2	Use an op counter to decide when to realize a kernel (#117030 ) Instead of checking the number of bytes in the string representation of the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-01-27 05:28:46 +00:00
eellison	b95c45fbf7	add stack trace to device skip (#118112 ) Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112 Approved by: https://github.com/bdhirsh	2024-01-26 22:33:48 +00:00
Nikita Shulga	bd99115276	[AOTI] Enable for MacOS (#118076 ) - Add `darwin` to the list of supported platform - Add `#include <sstream>` to `aoti_runtime/model.h` - Refactor Linux specific constant compilation logic to `_compile_consts_linux` - Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library - Patch file using magic to avoid converting bytes to large hexadecimal string - Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition) - Enable test_aot_inductor.py tests on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076 Approved by: https://github.com/desertfire ghstack dependencies: #118077	2024-01-24 14:24:05 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
James Wu	afabed6ae6	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-21 18:47:01 +00:00
PyTorch MergeBot	10923f8720	Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 )" This reverts commit `1967394690`. Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS `1967394690`, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))	2024-01-20 02:14:58 +00:00
James Wu	1967394690	[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298 ) fixes #116715 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298 Approved by: https://github.com/eellison	2024-01-20 01:37:28 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
Bin Bao	fad7734fa7	[AOTI] Remove caching for compiled model.so (#117087 ) Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it. Test Plan: CI Differential Revision: D52647555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087 Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov	2024-01-10 19:53:27 +00:00
Jack Taylor	5046b4981d	[ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329 ) Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted. If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329 Approved by: https://github.com/jansel, https://github.com/eellison	2024-01-10 13:56:27 +00:00
Oleg Khabinov	5377b994da	[aot_inductor] Retrieve original FQNs for weights (#116157 ) Differential Revision: D52303882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157 Approved by: https://github.com/frank-wei	2024-01-05 21:30:36 +00:00
Bin Bao	e5bcfe205e	[inductor] fix cpp_wrapper inputs mismatch (#116197 ) Summary: fixes https://github.com/pytorch/pytorch/issues/115035, where in the cpp_wrapper JIT inductor, the input args should contain the lifted parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116197 Approved by: https://github.com/jansel	2023-12-26 21:41:47 +00:00
Bin Bao	f4230ec9fd	[inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205 Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel	2023-12-26 16:01:20 +00:00
etaf	7a6cb9fdfb	[Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020 ) As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend. ### Design Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation scattered in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code. For example: `2a44034895/torch/_inductor/codegen/wrapper.py (L487)` `2a44034895/torch/_inductor/codegen/triton.py (L1996)` So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility. Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel	2023-12-22 08:42:51 +00:00
CK Luk	3b70bd3970	Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979 ) Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels). Test Plan: Unit test: buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test PyPer Mast job: https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e See the *.py files generated in: pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623 Differential Revision: D52221500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979 Approved by: https://github.com/yanboliang	2023-12-18 18:16:44 +00:00
Bin Bao	0fc04e274d	[inductor] Fix an aliased output bug (#115373 ) Summary: for https://github.com/pytorch/pytorch/issues/97083, when Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373 Approved by: https://github.com/jansel	2023-12-12 01:18:59 +00:00
PyTorch MergeBot	5fe2b138e3	Revert "[inductor] Fix an aliased output bug (#115373 )" This reverts commit `1310f0bf38`. Reverted https://github.com/pytorch/pytorch/pull/115373 on behalf of https://github.com/atalman due to Sorry for reverting your change it broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/115373#issuecomment-1850792869))	2023-12-11 20:02:15 +00:00
Bin Bao	1310f0bf38	[inductor] Fix an aliased output bug (#115373 ) Summary: for https://github.com/pytorch/pytorch/issues/97083, when Pull Request resolved: https://github.com/pytorch/pytorch/pull/115373 Approved by: https://github.com/jansel	2023-12-10 23:52:39 +00:00
Jason Ansel	c370450f02	[inductor] Remove hashing of tensor data for constants (#115356 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115356 Approved by: https://github.com/eellison	2023-12-08 18:05:34 +00:00
Bin Bao	e06bff8bbe	[AOTI] Handle empty input args (#114682 ) Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682 Approved by: https://github.com/chenyang78	2023-12-05 15:02:17 +00:00
Jez Ng	f1fd02503b	Reland #113487 and #112527 (sdpa shim & fp8 AOTInductor support) (#114974 ) This is a backout of #113747 which reverted the above two commits. Now that #113997 has landed, this diff can be landed safely without breaking ABI compatibility. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114974 Approved by: https://github.com/chenyang78	2023-12-02 03:25:51 +00:00
Elias Ellison	7692595834	Use different conv layout optimization heuristics for inference (#114600 ) While many models regress in training when converted to channels last, in inference the results are quite different. Almost all of the models experienced a speedup when converted to channels last. There were a few big regressions in torchbench - `timm_regnet` from `1.4343 → 1.0573` and `timm_resnet` from `1.7484 → 1.2868`. I used a modified script of the operator benchmarks [here](https://gist.github.com/eellison/e11dc645412f52e8b45fb26ba6f9f6a1) to measure the average speedup of convolutions across all of the input shapes found in torchbench according to the existing classifications that @shunting314 used - grouped convs, small channel convs, convolution with larger in-channel than out-channel. Only grouped convolutions benchmarked as a slowdown in inference. I updated the inference heuristic to multiply the flops of each conv with its predicted speedup/slowdown in channels last. With this heuristic the two previously regressing models no longer regress. Speeds up inference for torchbench ~8% and timm ~6%. The motivating model here was SDXL which now hits channels last and improves 10%. There were some models that were sped up in training when forcing channels last (along with a number of regressions). It's possible there is some speedup in training to be had with additional heuristics. We could also have more granular classification/predictions which might benefit both training and inference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114600 Approved by: https://github.com/jansel, https://github.com/shunting314	2023-11-29 07:53:59 +00:00
Jez Ng	87925789ae	Make V.graph properly typed (#114025 ) Previously it lacked a type hint and so was treated as an Any type. This resulted in a lot of untyped code downstream as V.graph is referenced in many places in inductor code. I've typed it properly now as GraphLowering, and fixed the numerous type errors this surfaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025 Approved by: https://github.com/eellison ghstack dependencies: #114013	2023-11-21 02:14:29 +00:00
Bin Bao	5a96a42cea	[AOTI] Improve the two-pass wrapper codegen (#114067 ) Summary: For the second-pass, we don't have to rerun the whole inductor flow again. This PR moves that second-pass to the codegen time. This change not only speeds up the compilation, but also removes kernel scheduling inconsistency between the two passes. Another future improvement is to make the second-pass reuse the scheduler and do the wrapper codegen only. This is a copy of https://github.com/pytorch/pytorch/pull/113762 to land in github first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114067 Approved by: https://github.com/chenyang78	2023-11-19 23:30:36 +00:00
eellison	a9134fa99a	Skip cudagraphs when there is sparsity (#113791 ) Fix for dlrm training Pull Request resolved: https://github.com/pytorch/pytorch/pull/113791 Approved by: https://github.com/Chillee	2023-11-17 01:36:03 +00:00
Wei Wei	b19cf868e8	Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527 )" (#113747 ) Test Plan: sandcastle Differential Revision: D51330618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2023-11-15 22:42:22 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
Edward Z. Yang	9752ef595c	[BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071 Approved by: https://github.com/voznesenskym	2023-11-10 21:19:12 +00:00
Jez Ng	297c26bb8e	Support fp8 in AOTInductor + support optional<> in C ABI (#112527 ) This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358 It turns out that we need to add support for optional types in order to support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface can't support optional<> directly, I am passing in optional types via pointer instead. `AtenTensorHandle`s are already pointers, so nothing needs to change there. Only value types need to change. We decided on this approach instead of adding an extra `bool` param to the callee because this simplifies things. Having the same number of arguments regardless of whether we are emitting Python / C++ / ABI-compatible C++ makes codegen easier. There are a number of existing ABI-compatible functions that have optional-typed value parameters. Previously, they just assumed they would never be passed a `nullopt` / `None` at runtime. Changing them to use pointer types now would break ABI stability, so I have created an exclude list for those functions. Finally, I think the current implementation is kind of messy, and only works for FallbackKernels, even though technically ExternKernels could also have the same issue. It also doesn't support optional types nested in lists. I've left FIXME comments for both issues. Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 22:56:48 +00:00
Jason Ansel	3914566c73	[dynamo] Refactor OrderedDict to dict (#113234 ) In Python3 all dicts are ordered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113234 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-11-08 09:27:08 +00:00
Edward Z. Yang	10a829b85d	Retarget sym_size/sym_stride lowerings to their .int overloads (#113054 ) Fixes https://github.com/pytorch/pytorch/issues/112913 The new logging looks like this: ``` [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1] [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1] [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {}) [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0> [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {}) [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] via <function mul at 0x7f0abec8bd00> [2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,) ``` Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054 Approved by: https://github.com/davidberard98	2023-11-07 04:15:38 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
Kai Londenberg	bdfde62e54	[Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890 ) Summary: This PR adds epilogue fusion code generation support for the new experimental [Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]). Details: A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and performs the computation of the fused Pointwise / Elementwise computation nodes. This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu), which is currently the only documentation and example of Cutlass Epilogue Visitor Trees. This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform. A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments to each of the functor subexpressions. Step 1 functionality: * End to end code generation is possible using the above approach. * Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants ) after a matmul. * Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc. * Examples / Unit tests include ReLU and ReLU6 fusion. * Support for fp16 and fp16 with fp32 accumulation data types. * Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 ) The following is not yet supported, and is left for future work: * Full operation support ( e.g. full set of all ops usually handled via V.ops handlers ) * Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented ) * Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature * Add support for additional (auxiliary) outputs ( requires support for full computation graphs ) * Add support for reduction operations and operations which use different output layouts than the input * Add support for additional dtypes ( as far as Cutlass allows ) This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features for the inductor backend. See also Cutlass release notes: https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2 Notable changes in Cutlass 3.2.1 include: * Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to prevent namespace clashes without resolving to monkey-patching ( which was done earlier ). * Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet ) * Small API changes to the cutlass_library API ( requires adapting the inductor backend code ) Notable changes in Cutlass 3.2.2 include: * Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention Test Plan: * CI * pytest test/inductor/test_max_autotune.py Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled. Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890 Approved by: https://github.com/jansel ghstack dependencies: #112762	2023-11-06 19:42:10 +00:00
Ken Jin	674c104d12	Fix RecursionError in Inductor for large for loops (#112320 ) Fixes https://github.com/pytorch/pytorch/issues/111686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112320 Approved by: https://github.com/peterbell10	2023-11-05 13:12:54 +00:00
Jez Ng	ae85ba820f	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-11-02 07:39:13 +00:00
PyTorch MergeBot	74e6c877e9	Revert "[inductor] Memory planning (#112178 )" This reverts commit `f64a97c6f8`. Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too `f64a97c6f8` ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))	2023-11-01 00:03:56 +00:00
Jez Ng	f64a97c6f8	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-31 20:02:30 +00:00
Elias Ellison	6a99291546	Removing sdpa conv layout constraint (#112045 ) Previously layout opt with sdpa would cause failures because we would pass a non-dense last dim to sdpa. Those layout constraints have been added in prior prs. Now we can do conv layout opt with sdpa. Improves twins_pcpvt_base 1.4622 → 1.5351, xcit_large_24_p8_224 3.0681 → 3.1839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112045 Approved by: https://github.com/shunting314 ghstack dependencies: #111976, #111721	2023-10-27 05:40:43 +00:00
lezcano	47ccf04885	Split SymNode into its own file (#112037 ) This PR: - Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py` - Moves SymNode to `fx.experimental.sym_node`. - This file does not have any SymPy dependencies at import time - It installs the magic methods in Sym{Bool,Int,Float}. - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes - Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`. This breaks the import-time dependency between torch and SymPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037 Approved by: https://github.com/peterbell10 ghstack dependencies: #112035, #112036	2023-10-26 23:32:27 +00:00
Andrew Hu	8253e0524c	Add "device not supported" assert to inductor (#112001 ) Fixes #111999 Adds an assert that provides a more informative error message For example, when running a compiled function with mps (currently unsupported): ``` ... File "/Users/andrew.hu/Desktop/pytorch/torch/_inductor/graph.py", line 927, in init_wrapper_code assert wrapper_code_gen_cls is not None, f"Device {device_type} not supported" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: Device mps not supported ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112001 Approved by: https://github.com/peterbell10	2023-10-25 14:19:37 +00:00
Oguz Ulgen	977d3bcc46	[Inductor] Support user defined triton kernels in inductor (#111434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434 Approved by: https://github.com/jansel	2023-10-22 17:04:19 +00:00
Elias Ellison	0a147fd112	Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233 ) Improves perf of llama_v2 locally from 1.55 -> 1.57 The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise. Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was: ``` def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(q, k, cos, sin): iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False) # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length) unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0) position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None # The first two dimensions of cos and sin are always 1, so we can `squeeze` them. cos = cos.squeeze(1).squeeze(0) # [seq_len, dim] sin = sin.squeeze(1).squeeze(0) # [seq_len, dim] cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed ``` Also not sure if I should be more worried about concatting reduction->pointwise inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233 Approved by: https://github.com/Chillee	2023-10-21 02:34:05 +00:00
Jack Taylor	619ae87a1d	Disable inductor layout_opt on ROCm (#111474 ) Previously we disabled this option on none MI200 GPUs (https://github.com/pytorch/pytorch/pull/107812 due to worse NHWC conv performance on some cards. This PR will disable this feature for all GPUs to make this uniform for ROCm and due to perf regressions noted here https://github.com/pytorch/pytorch/pull/110319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111474 Approved by: https://github.com/jithunnair-amd, https://github.com/eellison	2023-10-20 09:31:01 +00:00
Sherlock Huang	1aad6d803a	[Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567 ) (#111396 ) This is a reland of #110567 with additional fbcode fixed. Summary: In ABI compatible mode, We always need op_overload.schema for FallbackKernel. Approved by: https://github.com/jansel Test Plan: contbuild & OSS CI, see `37a0265992` Differential Revision: D50339346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396 Approved by: https://github.com/chenyang78	2023-10-17 18:53:38 +00:00
Sam Larsen	0dfa354570	[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 ) Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR. Test Plan: * New unit tests exercising saving and load from the cache. * New unit tests to exercise the cache key calculations. * Ran several benchmarks to see cache hit and resulting compilation times. Differential Revision: [D50255289](https://our.internmc.facebook.com/intern/diff/D50255289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453 Approved by: https://github.com/eellison, https://github.com/Chillee	2023-10-13 13:33:56 +00:00
Oleg Khabinov	8209bbbd06	[AOTInductor] Improve validation for C++ wrapper codegen (#111102 ) It's a reimplementation of #111089 1. When using fake inputs make sure they are on the same device as the original inputs. 2. Don't change the value of self.cpp_wrapper from True to False if can't generate a C++ wrapper, instead have a check and fail early to avoid producing Python code for C++ compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111102 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w	2023-10-13 08:46:17 +00:00

1 2 3 4 5

223 Commits