pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Elias Ellison	0a147fd112	Pointwise fuse cat with pointwise inputs or outputs and <= 4 inputs (#111233 ) Improves perf of llama_v2 locally from 1.55 -> 1.57 The initial heuristic is to lower to pointwise if # of inputs is <= 4, and all the inputs are pointwise or cannot be memory planned away, or if all the outputs are pointwise. Perf run was +3% on inference.. There are definitely instances where we should be lowering to foreach_kernels, but it's less flexible for fusion. The motivating example was: ``` def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2] x2 = x[..., x.shape[-1] // 2 :] return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(q, k, cos, sin): iota = torch.ops.prims.iota.default(512, start = 0, step = 1, dtype = torch.int64, device = device(type='cuda', index=0), requires_grad = False) # File: /scratch/eellison/work/torchdynamo/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py:657, code: position_ids = position_ids.unsqueeze(0).view(-1, seq_length) unsqueeze = torch.ops.aten.unsqueeze.default(iota, 0) position_ids = torch.ops.aten.reshape.default(unsqueeze, [-1, 512]); unsqueeze = None # The first two dimensions of cos and sin are always 1, so we can `squeeze` them. cos = cos.squeeze(1).squeeze(0) # [seq_len, dim] sin = sin.squeeze(1).squeeze(0) # [seq_len, dim] cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim] q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin) return q_embed, k_embed ``` Also not sure if I should be more worried about concatting reduction->pointwise inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111233 Approved by: https://github.com/Chillee	2023-10-21 02:34:05 +00:00
Aaron Gokaslan	cb856b08b2	[BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496 ) Did some easy fixes from enabling TRY200. Most of these seem like oversights instead of intentional. The proper way to silence intentional errors is with `from None` to note that you thought about whether it should contain the cause and decided against it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111496 Approved by: https://github.com/malfet	2023-10-19 21:56:36 +00:00
Michael Lazos	543dc75746	[Reland] horizontal concat fusion (#111437 ) Reland https://github.com/pytorch/pytorch/pull/108115 The main fix is to disallow nop nodes to be included in foreach scheduler nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437 Approved by: https://github.com/yanboliang	2023-10-18 17:09:01 +00:00
Will Feng	b28cb43f5c	Intra-graph reordering pass on Inductor scheduler IR (based on #100762 ) (#108091 ) This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762. Main algorithm: 1. Greedily moves waits as late as possible (i.e. until we reach a use) 2. Greedily moves comms as early as possible (i.e. until we reach an input) 3. Move computes following simple heuristics to improve overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2023-10-14 14:51:24 +00:00
Jack Taylor	94c9dbff22	Disable cutlass_template on ROCm (#111132 ) Fixes #111066 #111065 #111064 Currently use_cutlass_template is returning True on ROCm but the feature is not supported. Fix to return false on ROCm. I considering adding this change to `try_import_cutlass` instead but the comments hinted that this function would be removed at some point. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111132 Approved by: https://github.com/jansel	2023-10-12 17:14:07 +00:00
eellison	c5f06b9753	Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651 ) - neg view can just be lowered to neg() post functionalization - we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007	2023-10-10 16:34:01 +00:00
chilli	f767a6c57a	Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504 Approved by: https://github.com/mlazos, https://github.com/eellison ghstack dependencies: #110501	2023-10-05 15:47:30 +00:00
PyTorch MergeBot	1e4c0641ce	Revert "Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 )" This reverts commit `9648df1a6a`. Reverted https://github.com/pytorch/pytorch/pull/110504 on behalf of https://github.com/PaliC due to temporarily will revert as it's causing problems with difftrain import ([comment](https://github.com/pytorch/pytorch/pull/110504#issuecomment-1749132253))	2023-10-05 15:28:23 +00:00
Kazuaki Ishizaki	434a996c42	Fix typo under torch/_inductor directory (#110530 ) This PR fixes typo of comments and messages in files under `torch/_dynamo` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530 Approved by: https://github.com/kit1980	2023-10-05 02:17:20 +00:00
chilli	9648df1a6a	Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504 Approved by: https://github.com/mlazos, https://github.com/eellison ghstack dependencies: #110501	2023-10-05 01:34:57 +00:00
Yang Chen	da63c7f2c3	[AOTInductor] remove CUDA dependency for cpp backend (#110409 ) Summary: Previously, we link against cuda libs even for pure cpp backend. This caused issues for cases where the inference platform does not have GPUs. This diff removed cuda dependency for cpp backend. Reviewed By: bertmaher, muchulee8, mikekgfb Differential Revision: D49800712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409 Approved by: https://github.com/bertmaher, https://github.com/desertfire	2023-10-03 18:36:00 +00:00
chilli	13681382d5	Add heuristic for when `evict_first` should be set (and some other minor things) (#108841 ) Example of when the `evict_first` heuristic helps. ``` @torch.compile def f(a, b): return (a * b).sum(dim=-1) N = 512 inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0)) from torch._inductor.utils import do_bench print(do_bench(lambda: f(*inps))) ``` This generates code like this: http://ix.io/4HFs ``` Original: 3.8 ms This PR: 3.54 ms Always `evict_first: 5.4ms ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841 Approved by: https://github.com/lezcano, https://github.com/jansel	2023-10-01 17:06:12 +00:00
Edward Z. Yang	d1a13129bb	Add support for item() and nonzero() codegen in Inductor (#109893 ) This is another version of https://github.com/pytorch/pytorch/pull/109262 that I think is more harmonious with inductor design. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893 Approved by: https://github.com/jansel	2023-09-28 23:37:31 +00:00
ydwu4	5f7eff0adb	Replace node.meta source_fn with source_fn_stack (#108595 ) A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions: This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack Before this PR, for the following example: ```python backend = EagerAndRecordGraphs() @torch.compile(backend=backend, fullgraph=True) def cond_f(pred, pred2, x, y): def true_fn(pred2, x, y): return x + y def false_fn(pred2, x, y): def true_fn2(x, y): return x.sin() - y.cos() def false_fn2(x, y): return x.cos() - y.sin() return control_flow.cond(pred2, true_fn2, false_fn2, (x, y)) return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y)) ``` The graph captured is shown below: ```python class GraphModule(torch.nn.Module): def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor): l_pred_ = L_pred_ l_pred2_ = L_pred2_ l_x_ = L_x_ l_y_ = L_y_ cond_true_1 = self.cond_true_1 cond_false_1 = self.cond_false_1 cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]); l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None return (cond,) class GraphModule(torch.nn.Module): def forward(self, l_pred2_, l_x_, l_y_): add = l_x_ + l_y_; l_x_ = l_y_ = None return add class GraphModule(torch.nn.Module): def forward(self, l_pred2_, l_x_, l_y_): cond_true_0 = self.cond_true_0 cond_false_0 = self.cond_false_0 cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]); l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None return cond class GraphModule(torch.nn.Module): def forward(self, l_x_, l_y_): sin = l_x_.sin(); l_x_ = None cos = l_y_.cos(); l_y_ = None sub = sin - cos; sin = cos = None return sub class GraphModule(torch.nn.Module): def forward(self, l_x_, l_y_): cos = l_x_.cos(); l_x_ = None sin = l_y_.sin(); l_y_ = None sub = cos - sin; cos = sin = None return sub ``` the source_fn for inner cond, sin, cos will be a (name, target) tuple: ``` ('cond', <torch._ops.HigherOrderOperator object at xxx>) ('sin', 'sin') ('cos', 'cos') ('sub'. <built-in function sub>) ``` After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list. ``` [('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)], [('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')], [('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')] [('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)] ``` Test Plan: See added tests in test_higher_order_ops.py and modify existing test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595 Approved by: https://github.com/angelayi, https://github.com/zou3519	2023-09-28 18:18:36 +00:00
Bin Bao	4bf1cd6961	[aotinductor] Rename aot_runtime to aoti_runtime (#110007 ) Summary: Make the naming more explicit Differential Revision: D49593528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110007 Approved by: https://github.com/houseroad	2023-09-26 00:46:54 +00:00
Yu, Guangye	e9c9b1ed59	[Inductor] Generalize inductor triton backend device agnostic (#109486 ) # Motivation @jansel As discussed before, we expected to generalize some cuda-specific code. This can make inductor more friendly to third-party backend so that we can leverage inductor code as much as possible. # Solution To implement this, we give a solution to introduce device runtime abstraction. We wrapper them inside `DeviceInterface` and use `register_interface_for_device` to register each kind of device to inductor. Then use `get_interface_for_device` to fetch the corresponding runtime from device type. Then usage is like this: ```python device_interface = get_interface_for_device("xpu") device_interface .is_available() # to check if XPU is available device_interface .device_count() # to check how much XPU device is available ``` The `DeviceInterface` is a simple abstraction, which enables third-party backends that implement CUDA-like semantics to be integrated with inductor. This can prevent third-party backend from using monkey patch to override some utility functions, like `decode_device` that is hard-coded with CUDA. # Additional Context The main code change: - To leverage AsyncCompile, make it device-agnostic - Avoid monkey patches, make some utility functions device-agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/109486 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/EikanWang	2023-09-24 07:49:20 +00:00
Oguz Ulgen	1df14f1bf8	Move has_triton to top level triton utils so that dynamo can also access (#109832 ) it without creating cyclic dependencies Pull Request resolved: https://github.com/pytorch/pytorch/pull/109832 Approved by: https://github.com/zou3519	2023-09-22 19:33:41 +00:00
Bin Bao	8856c1628e	[inductor] Change AOTInductor to return output tensors (#109790 ) Summary: Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits: * It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable. * As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance. * With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability. This change also combines D49494954 from Yang and https://github.com/pytorch/pytorch/pull/109560 from Angela. Differential Revision: D49502318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109790 Approved by: https://github.com/chenyang78	2023-09-22 02:31:52 +00:00
Angela Yi	f7ddc54503	[aotinductor] Update performance benchmark code (109560) (#109820 ) Summary: Same as #109560, made a new PR because we need to land from internal Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after https://github.com/pytorch/pytorch/pull/108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup. This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs. For example, ``` python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM ``` results in `1.359x` speedup. Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it. [Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8) Test Plan: CI Differential Revision: D49513934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109820 Approved by: https://github.com/desertfire	2023-09-21 20:49:41 +00:00
Bin Bao	9c2715bbb2	[inductor] Clean up AOTInductor runtime ABI (#109678 ) Summary: Change the AOTInductor runtime interface to avoid referring to aten data structures directly, mostly at::Tensor and ProxyExecutor. This a combination of https://github.com/pytorch/pytorch/pull/109436, https://github.com/pytorch/pytorch/pull/109498, https://github.com/pytorch/pytorch/pull/109450, https://github.com/pytorch/pytorch/pull/109606, plus a few internal build changes. Reviewed By: frank-wei Differential Revision: D49374820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109678 Approved by: https://github.com/frank-wei, https://github.com/chenyang78	2023-09-21 00:25:24 +00:00
Bert Maher	e87bd9f588	[aot inductor] Make unit tests work on CPU (#109625 ) Summary: AOT inductor is only sort-of supported on CPU right now, but it works with a few hacks (the .so needs to be compiled and run with CUDA present, because we haven't excised the CUDA deps; also there's an `is_cpu` flag that needs to be plumbed into the call, or else all the weights are erroneously allocated on GPU). But, with those hacks in place, it currently works, so it's worth having the current state of it continue working (and at some point we'll remove the hacks). Test Plan: ``` python test_aot_inductor -k test_simple_cpu ``` Reviewers: binbao Subscribers: Tasks: Tags: Differential Revision: [D49427400](https://our.internmc.facebook.com/intern/diff/D49427400) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109625 Approved by: https://github.com/mikekgfb, https://github.com/chenyang78, https://github.com/desertfire	2023-09-20 14:51:44 +00:00
Yang Chen	9cd4548f01	AOTInductor dynamic shape (#109012 ) Summary: This PR adds dynamic-shape support for AOTInductor * On the runtime/interface side, we added two structs, StaticDimInfo and DynamicDimInfo, to hold values for static and dynamic dimensions, respectively. Dynamic dimensions are tracked by an unordered map field defined in AOTInductorModelBase. At inference time, the inference run method will assign the current real dimensional value to each dynamic dimension before executing any kernel. * On the CUDA wrapper codegen side, we generate dynamic symbols appropriately for shape computations. We simulate kernel launch grids in the C++ land by re-using the grid functions from the Python world. The returned grid configs, which may contain symbolic expressions, are printed out in their C++ forms via the CppPrinter. Note that when dynamic shapes are involved, we have to compute grid configs for each kernel at runtime in the same way as we do for launching the corresponding Triton kernel. Otherwise, we may end up with memory-access failures or mis-computations caused by invalid indices for fetching or storing data in device memory. Differential Revision: D49100472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109012 Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/hl475	2023-09-14 08:00:30 +00:00
Jez Ng	d2d36aad6f	Enable typechecking for _inductor/virtualized.py (#108916 ) Also add a few more type annotations to utils.py (some of its functions are called from virtualized.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108916 Approved by: https://github.com/eellison	2023-09-13 13:04:51 +00:00
Ying Zhang	a2d5f13310	[Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates (#108015 ) This is the step 5 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931	2023-09-12 17:44:38 +00:00
Ying Zhang	097fd43f8c	[Inductor CUTLASS backend] Step 4: CUDA (template) kernels (#107931 ) This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901	2023-09-12 17:44:38 +00:00
Ying Zhang	b2d764ece0	[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest (#107901 ) This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107901 Approved by: https://github.com/jansel, https://github.com/aakhundov, https://github.com/kadeng ghstack dependencies: #107802, #107847	2023-09-12 17:44:36 +00:00
Ying Zhang	102fefac21	[Inductor CUTLASS backend] Step 2: CUDACodeCache (#107847 ) This is the step 2 to add cutlass as an alternative inductor backend. Feature request: https://github.com/pytorch/pytorch/issues/106991. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107847 Approved by: https://github.com/jansel, https://github.com/kadeng, https://github.com/aakhundov ghstack dependencies: #107802	2023-09-12 17:44:34 +00:00
David Berard	ed7f9cac91	[inductor] Add CPU-side profiler event names for templates and foreach kernels (#108449 ) This passes in the descriptive kernel name as part of the triton_meta dict that gets passed to the CachingAutotuner, for foreach kernels and templates. Before: <img width="684" alt="Screenshot 2023-09-01 at 11 56 02 AM" src="https://github.com/pytorch/pytorch/assets/5067123/c14e13fc-0d9e-425a-a08b-613ef42aa264"> After: <img width="562" alt="Screenshot 2023-09-01 at 2 13 00 PM" src="https://github.com/pytorch/pytorch/assets/5067123/551bb9a9-865b-401e-b6e0-8ebbe5431565"> This PR also refactors the "magic strings" (KERNEL_NAME and DESCRIPTIVE_KRNL_NAME) into an enum in utils.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108449 Approved by: https://github.com/jansel	2023-09-09 02:11:13 +00:00
Bin Bao	e91f66471c	[reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/108663 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108878 Approved by: https://github.com/muchulee8	2023-09-08 17:58:35 +00:00
Michael Lazos	6c7260407b	Back out "Horizontally fuse input concatenation (#108115 )" (#108793 ) Summary: Original commit changeset: f15956d96311 Original Phabricator Diff: D48996091 Test Plan: Reverting to Unbreak test Differential Revision: D49065517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108793 Approved by: https://github.com/Chillee	2023-09-08 05:14:57 +00:00
PyTorch MergeBot	428f5f9e7e	Revert "[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 )" This reverts commit `366ce589d0`. Reverted https://github.com/pytorch/pytorch/pull/108663 on behalf of https://github.com/Chillee due to Sorry :'( Need to revert to resolve merge conflict for another revert ([comment](https://github.com/pytorch/pytorch/pull/108663#issuecomment-1711076411))	2023-09-08 05:01:27 +00:00
Bin Bao	366ce589d0	[inductor] Switch to use the runtime interface for AOTInductor testing (#108663 ) Summary: Switch AOTInductor unit tests and integration tests to invoke the same runtime interface. This is only an effort to unify the usage of the runtime. The interface scrutiny will come in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108663 Approved by: https://github.com/ezyang ghstack dependencies: #108653	2023-09-07 23:38:11 +00:00
chunyuan	ca9f4222e1	Inductor cpp wrapper: fix codegen of positional args with default value (#108552 ) Fixes https://github.com/pytorch/pytorch/issues/108323. Cpp wrapper has functionality regression on `llama` and `tnt_s_patch16_224` due to recent support of scaled dot product flash attention in inductor. The schema of this OP is as follows: ``` - func: _scaled_dot_product_flash_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, int max_q, int max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask) ``` For `llama` and `tnt_s_patch16_224`, the OP is called in the below way, where the three positional args with default values are not passed (`float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False`). ```python y = torch.ops.aten._scaled_dot_product_flash_attention.default(x0, x1, x2, scale = 0.125) ``` This PR fixes the cpp wrapper support for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108552 Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel	2023-09-06 13:15:12 +00:00
Michael Lazos	96d74073f8	Horizontally fuse input concatenation (#108115 ) Fixes https://github.com/pytorch/pytorch/issues/106688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115 Approved by: https://github.com/jansel	2023-09-05 16:55:32 +00:00
PyTorch MergeBot	2c1f0772d5	Revert "Horizontally fuse input concatenation (#108115 )" This reverts commit `5911faeb8f`. Reverted https://github.com/pytorch/pytorch/pull/108115 on behalf of https://github.com/osalpekar due to Broke internal benchmarking job. See [D48890838](https://www.internalfb.com/diff/D48890838) ([comment](https://github.com/pytorch/pytorch/pull/108115#issuecomment-1703546520))	2023-09-02 00:19:00 +00:00
Michael Lazos	5911faeb8f	Horizontally fuse input concatenation (#108115 ) Fixes https://github.com/pytorch/pytorch/issues/106688 Pull Request resolved: https://github.com/pytorch/pytorch/pull/108115 Approved by: https://github.com/jansel	2023-08-30 21:57:11 +00:00
chilli	39130c7433	Add reinplacing pass for scatters + incremental fake tensor updating (#106192 ) mutation for params) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106192 Approved by: https://github.com/jansel, https://github.com/eellison	2023-08-30 20:41:37 +00:00
Shunting Zhang	e68b3ad14f	update triton pin with needed inductor change (#107722 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107722 Approved by: https://github.com/jansel, https://github.com/cpuhrsch	2023-08-29 04:31:44 +00:00
Jason Ansel	de5ffa8a3a	[inductor] Add aten.multinomial to disallowed cudagraphs ops (#108105 ) Fixes: ```python CUDA_LAUNCH_BLOCKING=1 ./benchmarks/dynamo/torchbench.py --inference --performance --no-skip --inductor --freezing --only nanogpt_generate loading model: 0it [00:00, ?it/s]number of parameters: 123.69M loading model: 0it [00:07, ?it/s] cuda eval nanogpt_generate ERROR:common:Backend dynamo failed in warmup() Traceback (most recent call last): File "/data/users/jansel/pytorch/torch/_inductor/cudagraph_trees.py", line 1084, in _record static_outputs = model(inputs) File "/data/users/jansel/pytorch/torch/_inductor/codecache.py", line 401, in _run_from_cache return compiled_graph.compiled_artifact(inputs) File "/tmp/torchinductor_jansel/db/cdbk4ip3fucyoccnbnoik2crjpdkliwxll653l7l3wwsxiygmade.py", line 18375, in call buf239 = aten.multinomial.default(buf238, 1) File "/data/users/jansel/pytorch/torch/_ops.py", line 448, in __call__ return self._op(args, *kwargs or {}) RuntimeError: CUDA error: operation not permitted when stream is capturing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/108105 Approved by: https://github.com/eellison ghstack dependencies: #108096, #108087, #108098	2023-08-29 02:58:48 +00:00
Peter Bell	18b1c2907d	[inductor] Add ir.WelfordReduction with multiple outputs (#104725 ) This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously. Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725 Approved by: https://github.com/lezcano	2023-08-18 08:18:01 +00:00
Jez Ng	a815e719e8	Turn on typechecking for _inductor/utils.py (#106252 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106252 Approved by: https://github.com/Skylion007	2023-08-18 04:11:34 +00:00
Simon Fan	aca3d1433c	Estimate Scheduler node runtimes (#106426 ) Working as starter task with @Chillee This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds. We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded: - memory bandwidth bounded: we compute the number of bytes that are read/written to - compute bounded: we compute the FLOPS required by the operation One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762 ``` (pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests [(ExternKernelSchedulerNode(name='buf0'), 400)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)] .[(ExternKernelSchedulerNode(name='buf0'), 3000)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)] .[(ExternKernelSchedulerNode(name='buf0'), 34600)] [(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 396)] [(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)] .[(ExternKernelSchedulerNode(name='buf0'), 7776176)] [(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)] .[(FusedSchedulerNode(nodes=buf0_buf1), 210)] [(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)] .[(ExternKernelSchedulerNode(name='buf0'), 300)] [(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)] .[(SchedulerNode(name='buf0'), 20)] [(SchedulerNode(name='buf0'), 4.7913701587934585e-11)] . ---------------------------------------------------------------------- Ran 10 tests in 14.311s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426 Approved by: https://github.com/Chillee	2023-08-17 17:23:30 +00:00
Shunting Zhang	ce608712cb	[inductor] don't cache non-static content (#106502 ) I happened to find that inductor may cache stale inner_fn_str and ReadWrites object in a ComputedBuffer when I work on looping ordering. Let's say we have producer buffer buf0 and consumer buffer buf1. Before we call GraphLowering.finalize, the layout for buf0 may be a FlexibleLayout. At that moment, the inner_fn_str or ReadWrites object computed for buf1 will be based on the layout of buf0 which most likely is a contiguous FlexibleLayout. And they will be cached on buf1 object (or buf1.data). However after we call GraphLowering.finalize, we may realize it's better to give a non-contiguous layout for buf0 (e.g., if its input has non-contiguous layout or whatever reason). The layout change of buf0 should affect the inner_fn_str and ReadWrites object for buf1. But we may have cached those on buf1. The stale ReadWrites objects for buf1 may result in sub-optimal strides for buf1. This may affect perf and I'll check the nightly runs. Here is a dump of `nodes` in `Scheduler.__init__` before the fix as a reference: https://gist.github.com/shunting314/ed2152a08e268f5563fd55398b1392c7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106502 Approved by: https://github.com/jansel	2023-08-03 22:09:58 +00:00
Ying Zhang	98956c5320	Support dynamic shapes in TritonTemplates (#105295 ) Currently when dynamic=True, TritonTemplates won't be used, as the condition `if list(call_args) != expected_args` defined in `TritonTemplate` cannot be satisfied. This PR tries to fix this issue by allowing passing symbolic variable names via `extra_args` and replacing all symbolic values in the generated TritonTemplate code as call_arg names. With this change, a locally compiled mm + epilogue node calls into the Triton kernel successfully. This PR also introduces a new config "max_autotune_gemm_backends" to allow specifying candidate gemm backends for max autotune. Current choices: combinations of ATEN, TRITON. This makes tests easier, so that we can explicitly test Triton gemm kernels + epilogue fusions + dynamic shapes, without falling back to ATen ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105295 Approved by: https://github.com/jansel	2023-07-25 01:41:25 +00:00
SherlockNoMad	a44f8894fa	[Inductor] Provenance tracking for wrapper code (#105717 ) Summary: Add comments in wrapper code for better provenance tracking Sample inductor wrapper output: ``` # Source Nodes: [mm_1], Original ATen: [aten.mm] extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1) # Source Nodes: [l__self___linear], Original ATen: [aten.addmm] extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0) ``` in cpp wrapper ``` // Source Nodes: [bmm_1], Original ATen: bmm at::bmm_out(buf0, arg0_1, arg1_1); ``` Test Plan: OSS CI Differential Revision: D47657260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717 Approved by: https://github.com/desertfire, https://github.com/jansel	2023-07-21 23:06:43 +00:00
Shunting Zhang	1e87778552	[inductor] refactor wrapper benchmark code out of utils.py (#105584 ) Refactor wrapper benchmark out of utils.py since 1. utils.py gets too large 2. I plan to add more code to wrapper benchmark for multi-kernel. This is split out from https://github.com/pytorch/pytorch/pull/103469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584 Approved by: https://github.com/jansel	2023-07-21 00:01:35 +00:00
Justin Chu	cb7a30f656	[BE] Enable ruff's UP rules and autoformat inductor/ (#105431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431 Approved by: https://github.com/albanD	2023-07-19 13:45:00 +00:00
Edward Z. Yang	2fa7d11b64	Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971 ) Previously, we made backwards graph compilation lazy to avoid paying for compilation if the user didn't actually end up using the backwards graph. This was useful in the old days when a lot of things in Inductor didn't work and we could bypass errors this way. However, this has a bad implication for dynamic shapes: the backwards graph compilation can trigger extra guards, which are too late to install in the Dynamo context if we wait until backwards is being run. So in this PR I move us back to compiling backwards graph immediately if we capture any SymInts for backwards. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971 Approved by: https://github.com/Chillee	2023-07-17 15:37:17 +00:00
chunyuan	1fdc88f877	Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575 ) Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`: ``` random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype) ``` For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case). The current support in this PR is conservative and we'll extend the functionality in subsequent PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575 Approved by: https://github.com/jgong5, https://github.com/desertfire	2023-07-15 03:33:44 +00:00
Peter Bell	66fb83293e	[inductor] Add min/max to index propagation pass (#105020 ) This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing into direct indexing expressions. I also add support to the cpp printer for Min/Max and fix the triton printer to support multi-argument Min/Max. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020 Approved by: https://github.com/lezcano	2023-07-12 19:03:01 +00:00

1 2 3

138 Commits