pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
eellison	a9134fa99a	Skip cudagraphs when there is sparsity (#113791 ) Fix for dlrm training Pull Request resolved: https://github.com/pytorch/pytorch/pull/113791 Approved by: https://github.com/Chillee	2023-11-17 01:36:03 +00:00
Wei Wei	b19cf868e8	Back out "Support fp8 in AOTInductor + support optional<> in C ABI (#112527 )" (#113747 ) Test Plan: sandcastle Differential Revision: D51330618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113747 Approved by: https://github.com/chenyang78, https://github.com/khabinov	2023-11-15 22:42:22 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
Edward Z. Yang	9752ef595c	[BE] Consistently use the sym_stride lowering, instead of short-circuiting before (#113071 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113071 Approved by: https://github.com/voznesenskym	2023-11-10 21:19:12 +00:00
Jez Ng	297c26bb8e	Support fp8 in AOTInductor + support optional<> in C ABI (#112527 ) This was originally ipiszy's PR: https://github.com/pytorch/pytorch/pull/112358 It turns out that we need to add support for optional types in order to support fp8 gemm (i.e. scaled_mm). Since our ABI-stable C interface can't support optional<> directly, I am passing in optional types via pointer instead. `AtenTensorHandle`s are already pointers, so nothing needs to change there. Only value types need to change. We decided on this approach instead of adding an extra `bool` param to the callee because this simplifies things. Having the same number of arguments regardless of whether we are emitting Python / C++ / ABI-compatible C++ makes codegen easier. There are a number of existing ABI-compatible functions that have optional-typed value parameters. Previously, they just assumed they would never be passed a `nullopt` / `None` at runtime. Changing them to use pointer types now would break ABI stability, so I have created an exclude list for those functions. Finally, I think the current implementation is kind of messy, and only works for FallbackKernels, even though technically ExternKernels could also have the same issue. It also doesn't support optional types nested in lists. I've left FIXME comments for both issues. Differential Revision: [D51084289](https://our.internmc.facebook.com/intern/diff/D51084289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112527 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2023-11-08 22:56:48 +00:00
Jason Ansel	3914566c73	[dynamo] Refactor OrderedDict to dict (#113234 ) In Python3 all dicts are ordered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113234 Approved by: https://github.com/oulgen, https://github.com/lezcano	2023-11-08 09:27:08 +00:00
Edward Z. Yang	10a829b85d	Retarget sym_size/sym_stride lowerings to their .int overloads (#113054 ) Fixes https://github.com/pytorch/pytorch/issues/112913 The new logging looks like this: ``` [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg0_1 : [num_users=0] = placeholder[target=arg0_1] [2023-11-06 12:48:57,732] [0/0] torch._inductor.graph: [DEBUG] lowering %arg1_1 : [num_users=2] = placeholder[target=arg1_1] [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] lowering %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, 1), kwargs = {}) [2023-11-06 12:48:57,733] [0/0] torch._inductor.graph: [DEBUG] via <function make_pointwise.<locals>.inner at 0x7f0abed28ee0> [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %sym_stride_int : [num_users=1] = call_function[target=torch.ops.aten.sym_stride.int](args = (%add, 0), kwargs = {}) sym_stride [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] lowering %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%arg1_1, %sym_stride_int), kwargs = {}) [2023-11-06 12:48:57,735] [0/0] torch._inductor.graph: [DEBUG] via <function mul at 0x7f0abec8bd00> [2023-11-06 12:48:57,744] [0/0] torch._inductor.graph: [DEBUG] lowering return (mul,) ``` Notice that `sym_stride` no longer is hitting the lowering. This is what the behavior was before I broke it. A better refactor coming soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113054 Approved by: https://github.com/davidberard98	2023-11-07 04:15:38 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
Kai Londenberg	bdfde62e54	[Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890 ) Summary: This PR adds epilogue fusion code generation support for the new experimental [Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]). Details: A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and performs the computation of the fused Pointwise / Elementwise computation nodes. This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu), which is currently the only documentation and example of Cutlass Epilogue Visitor Trees. This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform. A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments to each of the functor subexpressions. Step 1 functionality: * End to end code generation is possible using the above approach. * Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants ) after a matmul. * Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc. * Examples / Unit tests include ReLU and ReLU6 fusion. * Support for fp16 and fp16 with fp32 accumulation data types. * Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 ) The following is not yet supported, and is left for future work: * Full operation support ( e.g. full set of all ops usually handled via V.ops handlers ) * Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented ) * Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature * Add support for additional (auxiliary) outputs ( requires support for full computation graphs ) * Add support for reduction operations and operations which use different output layouts than the input * Add support for additional dtypes ( as far as Cutlass allows ) This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features for the inductor backend. See also Cutlass release notes: https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2 Notable changes in Cutlass 3.2.1 include: * Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to prevent namespace clashes without resolving to monkey-patching ( which was done earlier ). * Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet ) * Small API changes to the cutlass_library API ( requires adapting the inductor backend code ) Notable changes in Cutlass 3.2.2 include: * Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention Test Plan: * CI * pytest test/inductor/test_max_autotune.py Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled. Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890 Approved by: https://github.com/jansel ghstack dependencies: #112762	2023-11-06 19:42:10 +00:00
Ken Jin	674c104d12	Fix RecursionError in Inductor for large for loops (#112320 ) Fixes https://github.com/pytorch/pytorch/issues/111686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112320 Approved by: https://github.com/peterbell10	2023-11-05 13:12:54 +00:00
Jez Ng	ae85ba820f	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-11-02 07:39:13 +00:00
PyTorch MergeBot	74e6c877e9	Revert "[inductor] Memory planning (#112178 )" This reverts commit `f64a97c6f8`. Reverted https://github.com/pytorch/pytorch/pull/112178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems that ROCm will need to be fixed for the new test too `f64a97c6f8` ([comment](https://github.com/pytorch/pytorch/pull/112178#issuecomment-1788195311))	2023-11-01 00:03:56 +00:00
Jez Ng	f64a97c6f8	[inductor] Memory planning (#112178 ) This was originally @jansel's PR: https://github.com/pytorch/pytorch/pull/102625, which I've built upon. This diff implements static memory planning. It's disabled by default while we examine its performance. We use a greedy-by-size approach. For dynamic shapes, the sizes of the example inputs are used as estimates when making planning decisions. We generate expressions to calculate the actual memory offsets and sizes at runtime when the values of the dynamic shapes are known. In order to simplify these calculations, we have organized the allocations into a tree that branches on space (address offsets) and time (live ranges). Finally, we need to align these offsets, so we have added an `align` sympy Expr to express these calculations. Some limitations: 1. It is only enabled during inference for now. Enabling it for training increases peak memory usage as we allocate all the memory needed for training upfront, before freeing the memory allocated during inference. We can probably address this by doing planning for both the inference and training passes together. 2. It doesn't work with PyTorch Distributed, because kernels like AllGatherIntoTensor codegen strings which do memory operations. We can fix this down the line by having them emit MemoryPlanningLines instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112178 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-10-31 20:02:30 +00:00
Elias Ellison	6a99291546	Removing sdpa conv layout constraint (#112045 ) Previously layout opt with sdpa would cause failures because we would pass a non-dense last dim to sdpa. Those layout constraints have been added in prior prs. Now we can do conv layout opt with sdpa. Improves twins_pcpvt_base 1.4622 → 1.5351, xcit_large_24_p8_224 3.0681 → 3.1839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112045 Approved by: https://github.com/shunting314 ghstack dependencies: #111976, #111721	2023-10-27 05:40:43 +00:00
lezcano	47ccf04885	Split SymNode into its own file (#112037 ) This PR: - Moves TrueDiv, LShift, RShift, IsNonOverlappingAndDenseIndicator to `_sympy.functions.py` - Moves SymNode to `fx.experimental.sym_node`. - This file does not have any SymPy dependencies at import time - It installs the magic methods in Sym{Bool,Int,Float}. - N.b. With this split, we may be able to move Sym{Bool,Int,Float} to this file, and remove quite a few of the hacks around these classes - Imports `sym_node` in `torch/__init__.py` rather than the whole `symbolic_shapes.py`. This breaks the import-time dependency between torch and SymPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/112037 Approved by: https://github.com/peterbell10 ghstack dependencies: #112035, #112036	2023-10-26 23:32:27 +00:00
Andrew Hu	8253e0524c	Add "device not supported" assert to inductor (#112001 ) Fixes #111999 Adds an assert that provides a more informative error message For example, when running a compiled function with mps (currently unsupported): ``` ... File "/Users/andrew.hu/Desktop/pytorch/torch/_inductor/graph.py", line 927, in init_wrapper_code assert wrapper_code_gen_cls is not None, f"Device {device_type} not supported" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: Device mps not supported ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112001 Approved by: https://github.com/peterbell10	2023-10-25 14:19:37 +00:00
Oguz Ulgen	977d3bcc46	[Inductor] Support user defined triton kernels in inductor (#111434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111434 Approved by: https://github.com/jansel	2023-10-22 17:04:19 +00:00
Elias Ellison	0a147fd112	Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233 ) Improves perf of llama_v2 locally from 1.55 -> 1.57 The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise. Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was: ``` def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(q, k, cos, sin): iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False) # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length) unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0) position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None # The first two dimensions of cos and sin are always 1, so we can `squeeze` them. cos = cos.squeeze(1).squeeze(0) # [seq_len, dim] sin = sin.squeeze(1).squeeze(0) # [seq_len, dim] cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed ``` Also not sure if I should be more worried about concatting reduction->pointwise inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233 Approved by: https://github.com/Chillee	2023-10-21 02:34:05 +00:00
Jack Taylor	619ae87a1d	Disable inductor layout_opt on ROCm (#111474 ) Previously we disabled this option on none MI200 GPUs (https://github.com/pytorch/pytorch/pull/107812 due to worse NHWC conv performance on some cards. This PR will disable this feature for all GPUs to make this uniform for ROCm and due to perf regressions noted here https://github.com/pytorch/pytorch/pull/110319 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111474 Approved by: https://github.com/jithunnair-amd, https://github.com/eellison	2023-10-20 09:31:01 +00:00
Sherlock Huang	1aad6d803a	[Reland][Inductor] Disallow OpOverloadPacket in ir.FallbackKernel (#110567 ) (#111396 ) This is a reland of #110567 with additional fbcode fixed. Summary: In ABI compatible mode, We always need op_overload.schema for FallbackKernel. Approved by: https://github.com/jansel Test Plan: contbuild & OSS CI, see `37a0265992` Differential Revision: D50339346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111396 Approved by: https://github.com/chenyang78	2023-10-17 18:53:38 +00:00
Sam Larsen	0dfa354570	[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 ) Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR. Test Plan: * New unit tests exercising saving and load from the cache. * New unit tests to exercise the cache key calculations. * Ran several benchmarks to see cache hit and resulting compilation times. Differential Revision: [D50255289](https://our.internmc.facebook.com/intern/diff/D50255289) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453 Approved by: https://github.com/eellison, https://github.com/Chillee	2023-10-13 13:33:56 +00:00
Oleg Khabinov	8209bbbd06	[AOTInductor] Improve validation for C++ wrapper codegen (#111102 ) It's a reimplementation of #111089 1. When using fake inputs make sure they are on the same device as the original inputs. 2. Don't change the value of self.cpp_wrapper from True to False if can't generate a C++ wrapper, instead have a check and fail early to avoid producing Python code for C++ compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111102 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w	2023-10-13 08:46:17 +00:00
PyTorch MergeBot	7fbfa4e020	Revert "[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 )" This reverts commit `fc1105b282`. Reverted https://github.com/pytorch/pytorch/pull/103453 on behalf of https://github.com/kit1980 due to Same issue unfortunately, the newly added test fails on internal builds ([comment](https://github.com/pytorch/pytorch/pull/103453#issuecomment-1760202365))	2023-10-12 18:54:51 +00:00
Sam Larsen	fc1105b282	[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 ) Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR. Test Plan: * New unit tests exercising saving and load from the cache. * New unit tests to exercise the cache key calculations. * Ran several benchmarks to see cache hit and resulting compilation times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453 Approved by: https://github.com/eellison, https://github.com/Chillee	2023-10-11 14:39:14 +00:00
PyTorch MergeBot	3100d3e661	Revert "[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 )" This reverts commit `8a8668e1ae`. Reverted https://github.com/pytorch/pytorch/pull/103453 on behalf of https://github.com/kit1980 due to The newly added test fails on internal builds ([comment](https://github.com/pytorch/pytorch/pull/103453#issuecomment-1756449919))	2023-10-10 23:21:59 +00:00
Sam Larsen	8a8668e1ae	[inductor] Implement Fx graph caching to improve warm compilation time. (#103453 ) Summary: Implement an on-disk cache to save and reuse compiled FX Graphs. This implementation does not handle tensors with symbolic shapes. This needs to be done in a follow-up PR. Test Plan: * New unit tests exercising saving and load from the cache. * New unit tests to exercise the cache key calculations. * Ran several benchmarks to see cache hit and resulting compilation times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103453 Approved by: https://github.com/eellison	2023-10-08 20:32:15 +00:00
Yang Chen	432df71820	[inductor] added a config to always add tensor constants (#110491 ) Summary: In some scenarios, we want to update constants at runtime. In such cases, we have to keep the original constants in the generated code without applying any constant-inlining optimizations. This PR adds a config to force us to add tensor constants. Differential Revision: D49895154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110491 Approved by: https://github.com/mikekgfb	2023-10-07 07:51:54 +00:00
Jez Ng	c77dd684c9	Enable typechecking in _inductor/ir.py (#110112 ) I used a bunch of ignore-type comments, mostly due to https://github.com/pytorch/pytorch/issues/109963. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110112 Approved by: https://github.com/peterbell10	2023-10-07 04:19:38 +00:00
Oguz Ulgen	e8ef8bfdce	[Inductor] Allow matmul to have flexiable layout when we are not autotuning (#110726 ) Fixes #102804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110726 Approved by: https://github.com/Chillee	2023-10-07 04:08:37 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
Bin Bao	8e14e76c34	[inductor] Enhance an input type assertion msg (#110176 ) Summary: to address https://github.com/pytorch/pytorch/issues/110089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110176 Approved by: https://github.com/angelayi	2023-09-28 13:35:11 +00:00
angelayi	c71a64ccce	[aotinductor] Rename if name is prefixed with integer (#110113 ) Fixes https://github.com/pytorch/pytorch/issues/109894. Since in c++ we cannot have variables that start with an integer, we can do some additional handling in inductor to not produce constant tensors with names starting with integers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110113 Approved by: https://github.com/desertfire	2023-09-28 07:26:28 +00:00
Edward Z. Yang	abd9b763ca	[RFC] Add debug log as we lower each FX node (#109602 ) I found this useful for orienting myself when I threw an error mid-lowering. What do other people think? Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109602 Approved by: https://github.com/malfet, https://github.com/voznesenskym	2023-09-22 03:10:22 +00:00
willfengg	772e104dfd	[inductor] visualize fused ops in svg graph (#107752 ) example usage * `TORCH_COMPILE_DEBUG=1 INDUCTOR_ORIG_FX_SVG=1 INDUCTOR_POST_FUSION_SVG=1 python trig.py`: show original fx node name, file, and code. see snapshot 2 where we have origin_0, 1, 2 * trig.py can be found in P816304818 Implementation * keep original fx graph in GraphLowering, ```self.orig_gm: torch.fx.GraphModule = gm.__copy__()``` * draw original fx graph with origins ir_post_fusion ```V.debug.draw_orig_fx_graph(self.orig_gm, self.scheduler.nodes)```. node.meta["buff_meta"] tracks buf_name <img width="350" alt="Screenshot 2023-08-29 at 12 40 24 PM" src="https://github.com/pytorch/pytorch/assets/134637289/c4e197cb-ab3b-4a09-a584-c1356376accb"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/107752 Approved by: https://github.com/mlazos	2023-09-21 08:03:05 +00:00
pytorchbot	faa5985dfe	Fix issue when input/output buffer of functional collective (e.g. allreduce / allgather) is incorrectly reused later (#108811 ) For this program: ```python def func(a, *, tag, ranks, group_size): ar = torch.ops.c10d_functional.all_reduce(a, "sum", tag, ranks, group_size) ar = torch.ops.c10d_functional.wait_tensor(ar) c = torch.relu(a) # c = a d = torch.matmul(c, c) e = d + ar return (e,) ``` the generated code is: ```python def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (4, 4), (4, 1)) with torch.cuda._DeviceGuard(1): torch.cuda.set_device(1) # no-op to ensure context buf0 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32) buf0.copy_(arg0_1) #no reuse buf1_pg = c10d._find_or_create_pg_by_ranks_and_tag('', [0, 1], 2) buf1 = buf0 buf1_work = dist.all_reduce(buf1, async_op=True, group=buf1_pg, op=fun_col_impl._str_to_reduce_op('sum')) fun_col_impl._register_tensor_work(buf1, buf1_work) del buf1 buf0 = _wait_tensor(buf0) buf2 = buf0 buf3 = buf0; del buf0 # reuse # Source Nodes: [relu], Original ATen: [aten.relu] stream1 = get_cuda_stream(1) triton_poi_fused_relu_0.run(arg0_1, buf3, 16, grid=grid(16), stream=stream1) del arg0_1 buf4 = empty_strided((4, 4), (4, 1), device='cuda', dtype=torch.float32) # Source Nodes: [add, relu], Original ATen: [aten.add, aten.relu] extern_kernels.addmm(buf2, buf3, buf3, alpha=1, beta=1, out=buf4) return (buf4, ) ``` We can notice that allreduce input (`buf1` which is alias of `buf0`) is incorrectly reused as input (`buf3`) to the triton `triton_poi_fused_relu_0` inplace kernel, diverging from eager mode logic. In general, we should make it so that Inductor doesn't try to reuse the input buffer to an inplace functional collective. We have a similar problem for output buffer of out-of-place functional collectives, see https://github.com/pytorch/pytorch/issues/108780#issuecomment-1714921994. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108811 Approved by: https://github.com/Chillee, https://github.com/wconstab	2023-09-13 21:39:37 +00:00
Sam Larsen	264f1e7b4c	[inductor] Enable Mypy Checking for torch/_inductor/codecache.py (#108789 ) Summary: Add type annotations to torch/_inductor/codecache.py and enable mypy checking Test Plan: `lintrunner torch/_inductor/*.py` `python test/inductor/test_max_autotune.py` `python test/inductor/test_aot_inductor.py` `python test/inductor/test_torchinductor.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108789 Approved by: https://github.com/Skylion007, https://github.com/eellison	2023-09-13 14:05:35 +00:00
Mu-Chu Lee	30a33b76b9	[AOTInductor] Include constants in AOTInductor .so file. (#108473 ) Summary: Include constants in AOTInductor .so file. Added some difference: 1) serialize with ctypes instead of the native of torch.storage 2) Use the underlying for_blob instead of from_blob to construct Tensor. Test Plan: Unit tests: ``` test/inductor/test_aot_inductor.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108473 Approved by: https://github.com/angelayi	2023-09-08 03:49:53 +00:00
Sherlock Huang	b9dfdc091b	[AOTInductor][Reland] Proxy Executor for Extern Fallback kernels (#107279 ) (#108350 ) Summary: This is a prototype for running extern fallback kernels with a host side proxy executor. Sample of generated cpp wrapper call: ``` at::Tensor buf0; // output buffer void* tensor_args_var_0[] = {&arg0_1, &arg0_1, &arg1_1, &arg0_1, &arg1_1, &buf0}; int64_t int_args_var_1[] = {81, 81, 7, 7, 7, 81}; proxy_executor->call_function("buf0", int_args_var_1, tensor_args_var_0); ``` - In my current implementation, proxy executor interprets the raw pointers according to the ops schema. This assumes that custom op MUST have a valid schema registered to Dispatcher. (I would like to validate this assumption) - I am using callboxed() API of the custom kernels. This is inevitable, as we wish to have a single call_function API for all possible custom kernels. - These are all the input argument types I have support so far. union Argument { # Bool value does not matter 1: bool asNone; 2: TensorArgument asTensor; 3: list<TensorArgument> asTensors; 5: i64 asInt; 7: list<i64> asInts; 8: double asFloat; 9: list<double> asFloats; 10: string asString; 10.5: list<string> asStrings; 11: SymIntArgument asSymInt; 12: list<SymIntArgument> asSymInts; 13: ScalarType asScalarType; 14: MemoryFormat asMemoryFormat; 15: Layout asLayout; 16: Device asDevice; 17: bool asBool; 18: list<bool> asBools; } - Need a policy for handling unpopulated argument with default values. Here are the options, and it has BC implications. 1. requires exported fx graph to explicitly populate default values, if users doesn't specify. 2. requires cpp wrapper to explicitly populate default values, if fx graph doesn't specify. 3. Proxy executor look up from opSchema for default values. For fixing T162112344 Test Plan: frontend: buck2 run mode/dev-sand mode/inplace -c fbcode.enable_gpu_sections=True sigmoid/frontend:export_main test: buck2 run mode/dev-sand //deeplearning/aot_inductor/test:test_custom_ops backend: buck2 run mode/dev-nosan //deeplearning/aot_inductor/fb:main buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark -- --exact 'caffe2/torch/fb/model_transform/experimental/benchmark/test:test_aot_inductor_benchmark - test_aot_inductor_benchmark_cmf30x (caffe2.torch.fb.model_transform.experimental.benchmark.test.test_aot_inductor_benchmark.AOTInductorBenchmark)' Reviewed By: suo Differential Revision: D48747417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108350 Approved by: https://github.com/izaitsevfb	2023-09-02 17:14:10 +00:00
Bin Bao	06d74e6b24	Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349 ) This reverts commit `c3239442a3` due to internal test failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108349 Approved by: https://github.com/aakhundov, https://github.com/zhxchen17	2023-08-31 16:26:02 +00:00
Shunting Zhang	7cb4bf675b	[inductor] no-side-effect codegen (#107617 ) Inductor kernel codegen previously have the following side effect: - in `Kernel.__exit__ `, we add local used buffers in graph.removed_buffers - during codegen, we do memory allocation/free. These cause doing multiple versions of codegen for the same kernel hard. The PR refactor the code to make kernel codegen not changing graph level states. After codegening a kernel, the graph level state is not changed so we can go on to codegen another version of the kernel if we want. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107617 Approved by: https://github.com/jansel	2023-08-31 00:25:17 +00:00
chilli	39130c7433	Add reinplacing pass for scatters + incremental fake tensor updating (#106192 ) mutation for params) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192 Approved by: https://github.com/jansel, https://github.com/eellison	2023-08-30 20:41:37 +00:00
Jason Ansel	2c87ef3dbf	[inductor] Fix inputs with existing offsets (#108168 ) This cherrypicks the reinterpret_tensor change from #102625 in order to fix a subtle correctness bug when the graph inputs already have a storage_offset set. The view change also fixes some issues with quantized models in torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108168 Approved by: https://github.com/desertfire	2023-08-29 23:47:03 +00:00
Mu-Chu Lee	c3239442a3	[AOTInductor] Include constants in AOTInductor .so file. (#107718 ) Summary: Include the constants into AOTInductor .so file. We do not modify existing API signatures but create necessary format with weight lifted out instead. Test Plan: test/inductor/test_aot_inductor.py Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/107718 Approved by: https://github.com/angelayi, https://github.com/eellison	2023-08-29 22:37:30 +00:00
Xia, Weiwen	f3adbab4bb	[Quant][Inductor] Enable quantization linear pattern fusion inside inductor (#106934 ) Summary Enable lowering of quantized linear in Inductor Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary Pull Request resolved: https://github.com/pytorch/pytorch/pull/106934 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison ghstack dependencies: #105818, #106781, #106782	2023-08-27 13:00:16 +00:00
leslie-fang-intel	1374974d60	[Quant][Inductor] Enable quantization conv_binary(add/add_relu) pattern fusion inside inductor (#105456 ) Summary Enable the `dequant-conv2d-binary_postop(add)-unary_postop(relu)-quant` pattern fusion and lowering inside inductor. Test Plan ``` clear && python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105456 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581, #104588, #104590, #105455	2023-08-25 21:16:02 +00:00
leslie-fang-intel	46f63e283b	[Quant][Inductor] Enable quantization conv pattern fusion inside inductor (#104588 ) Summary Enable the `dequant-quantization-quant` pattern fusion and lowering inside inductor. Test Plan ``` python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104588 Approved by: https://github.com/jgong5, https://github.com/eellison ghstack dependencies: #104580, #104581	2023-08-25 17:57:13 +00:00
Prachi Gupta	1ef4bd169d	[ROCm] Add conditions for channels last logic (#107812 ) Although there are some performance benefits by enforcing NHWC convolutions as inductor's fallback method for all hardware this may not be the case. Currently on ROCm we are seeing some slow downs in gcnArch that do not have optimal NHWC implementations and would like to introduce some control on this behavior in pytorch. On ROCm MI200 series we will default to the enforced last channels behavior aligned with the rest of pytorch but on non-MI200 series we will disable the forced layout. For now we are using torch.cuda.get_device_name(0) for this control but we will replace with gcnArchName when https://github.com/pytorch/pytorch/pull/107477 lands. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107812 Approved by: https://github.com/jataylo, https://github.com/eellison	2023-08-24 19:39:49 +00:00
Simon Fan	aca3d1433c	Estimate Scheduler node runtimes (#106426 ) Working as starter task with @Chillee This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds. We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded: - memory bandwidth bounded: we compute the number of bytes that are read/written to - compute bounded: we compute the FLOPS required by the operation One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762 ``` (pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests [(ExternKernelSchedulerNode(name='buf0'), 400)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)] .[(ExternKernelSchedulerNode(name='buf0'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)] .[(ExternKernelSchedulerNode(name='buf0'), 34600)] [(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 7776176)] [(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)] .[(FusedSchedulerNode(nodes=buf0_buf1), 210)] [(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)] .[(ExternKernelSchedulerNode(name='buf0'), 300)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(SchedulerNode(name='buf0'), 20)] [(SchedulerNode(name='buf0'), 4.7913701587934585e-11)] . ---------------------------------------------------------------------- Ran 10 tests in 14.311s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426 Approved by: https://github.com/Chillee	2023-08-17 17:23:30 +00:00
Wang, Eikan	9921b48558	Extend Inductor to support the third-party backend (#106874 ) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. `c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)` In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about 57~58 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about 46~47 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about 47~48 ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel	2023-08-16 04:11:36 +00:00
Yanbo Liang	1819fe1324	Revert "Extend Inductor to support the third-party backend (#100706 )" (#106652 ) This reverts commit `05bd24bb35`. It caused compilation time regression on torchbench, huggingface and dynamic models. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-08-05 06:41:08 +00:00

1 2 3 4

195 Commits