pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yifu Wang	27ffede878	[reland] Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #119102	2024-02-12 18:48:06 +00:00
Peter Bell	88429a8084	[inductor] Add split scan kernel (#117992 ) This PR adds a new type of triton kernel in which data is persistent but the reduction dimension is split over multiple blocks (up to the entire kernel). though this is called a reduction dimension, in actuality we only support scans. because of this limitation, i have to be able to block fusions of split scan operations with reductions so chose to add a new `ir.SplitScan` node which is identical but allows for differentiation in the scheduler. The split scan kernel is also the first to require an additional workspace buffer which is used to communicate between cuda blocks. this is slightly tricky as we the exact scratch space requirement isn't known until the grid size is calculated. here i workaround the issue by setting a minimum rblock size and always allocating to the maximum possible grid size for a given input tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992 Approved by: https://github.com/jansel ghstack dependencies: #117991	2024-02-09 01:56:00 +00:00
PyTorch MergeBot	7315ec7505	Revert "Fix estimate_nccl_collective_runtime (#118986 )" This reverts commit `0dab6fb352`. Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))	2024-02-08 18:11:53 +00:00
Yifu Wang	0dab6fb352	Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #118910, #118911, #118437	2024-02-07 18:02:51 +00:00
Yang Chen	b2e0f8d82d	[mypy] added type annotations to codegen_nodes methods (#119080 ) added correct type annotations to scheduler and backends' codegen_nodes methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080 Approved by: https://github.com/eellison	2024-02-05 18:33:52 +00:00
Elias Ellison	1aa836f502	Dont fuse write into read if indexing differs (#118210 ) Fix for https://github.com/pytorch/pytorch/issues/101950, https://github.com/pytorch/pytorch/issues/94693 Similar to inplacing a kernel only fuse a write after a read of the same tensor if the write and read have same indexing formula. I did a perf test and it was neutral. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118210 Approved by: https://github.com/jansel	2024-01-30 21:55:27 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Aaron Gokaslan	bbe3261dd3	[BE]: Use `iterable.chain.from_iterable` where possible (#116376 ) This is more readable and more efficient when dealing with lots of sequences to chain together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376 Approved by: https://github.com/albanD	2023-12-27 19:20:07 +00:00
James Wu	2c89e5a5e5	[inductor] Sort unbacked symbols before iterating on them (#116421 ) get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism. Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration. Fixes #113130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421 Approved by: https://github.com/oulgen, https://github.com/aakhundov	2023-12-27 03:35:58 +00:00
Jez Ng	7571511af9	[inductor] More tweaks to fusion logs (#115084 ) I think it's more useful to print out actual fusions rather than possible fusions. I also updated `speedup_by_fusion`'s logs to include the node names in the log output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115084 Approved by: https://github.com/jansel, https://github.com/aakhundov	2023-12-26 20:25:57 +00:00
Oguz Ulgen	01b979fc9a	[Inductor] Fix constant folding and extern kernel mutation tracking bugs (#115908 ) This PR fixes two bugs 1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation 2) NoneLayout buffers should not be deleted as they do not exist Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908 Approved by: https://github.com/aakhundov, https://github.com/jansel	2023-12-19 02:06:50 +00:00
Bin Bao	7d4ccd7b9e	[AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766 ) Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766 Approved by: https://github.com/chenyang78 ghstack dependencies: #115783	2023-12-15 03:08:13 +00:00
Oguz Ulgen	af09fe256a	[Inductor] Implement a deduplist data structure for name to user tracking (#115609 ) Summary: An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list. Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics. Test Plan: ad hoc testing Differential Revision: D52060659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609 Approved by: https://github.com/jansel	2023-12-12 22:28:30 +00:00
Jiong Gong	534f25887b	[inductor] avoid inplace for ComplexView (#115166 ) Fix https://github.com/pytorch/pytorch/issues/115071 A regression introduced by https://github.com/pytorch/pytorch/pull/112875/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115166 Approved by: https://github.com/Skylion007	2023-12-06 04:52:28 +00:00
Bin Bao	e06bff8bbe	[AOTI] Handle empty input args (#114682 ) Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682 Approved by: https://github.com/chenyang78	2023-12-05 15:02:17 +00:00
Jez Ng	c808a84680	Better logging for "cannot fuse" reasons (#115003 ) This was invaluable when I was debugging #114917. Without the node names in the log message, it was difficult to make sense of them. However, I did not want to bloat the number of LOC with this change. Thus, instead of calling `debug()` directly with the node arguments, I made a new callable class WhyNoFuse to partially apply the node arguments at the top of each fusion-checking method. WhyNoFuse generates the logging string only when its `__str__` method gets called, so there is minimal overhead when logging is disabled. I also removed the various logging 'tags' like "vert:1" / "triton:1" -- the log messages themselves are unique enough that the user can identify them without the tag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003 Approved by: https://github.com/Skylion007	2023-12-03 04:48:43 +00:00
Michael Lazos	4c794f2ef1	Reinplace foreach when safe and allow aliasing during lowering (#112440 ) This reduces compile time of Adam on 1k parameters from 180s to 140s (28%), the main reason being that thousands of buffers no longer get sent to the scheduler. The idea behind this is that if a destination buffer (from a copy_) has no users, it shouldn't matter if dst aliases src. This is implemented by reinplacing copy_ nodes when safe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112440 Approved by: https://github.com/jansel	2023-11-27 21:32:42 +00:00
Xu Han	0f887a6d1a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-22 18:05:33 +00:00
Edward Z. Yang	7ea184d7e3	Handle item() on boolean tensor (#114157 ) This needs some special handling because we don't actually allocate boolean symbols in sympy; we allocate an integer indicator variable. See comment for more details. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114157 Approved by: https://github.com/ydwu4	2023-11-21 04:34:58 +00:00
Jez Ng	87925789ae	Make V.graph properly typed (#114025 ) Previously it lacked a type hint and so was treated as an Any type. This resulted in a lot of untyped code downstream as V.graph is referenced in many places in inductor code. I've typed it properly now as GraphLowering, and fixed the numerous type errors this surfaced. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025 Approved by: https://github.com/eellison ghstack dependencies: #114013	2023-11-21 02:14:29 +00:00
PyTorch MergeBot	ff7c06a01b	Revert "limit fused kernel num args. (#113131 )" This reverts commit `7b442c2b0a`. Reverted https://github.com/pytorch/pytorch/pull/113131 on behalf of https://github.com/albanD due to Breaks lint on trunk ([comment](https://github.com/pytorch/pytorch/pull/113131#issuecomment-1817548349))	2023-11-18 16:14:08 +00:00
Han, Xu	7b442c2b0a	limit fused kernel num args. (#113131 ) Fixes #97361 When fused kernel more than 1024 parameters, it should throw error from ctypes. Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation. Code change: 1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready. 2. scheduler will check `ready_to_flush` API and help backend flush codegen. 3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131 Approved by: https://github.com/jgong5, https://github.com/mlazos	2023-11-18 03:55:52 +00:00
Jon Chuang	6a25bb8545	[inductor] use fusion_log for verbose logs (#113701 ) Fixes https://github.com/pytorch/pytorch/issues/113696 Previous logs hygeine not respected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113701 Approved by: https://github.com/ezyang	2023-11-15 01:39:03 +00:00
Jason Ansel	24bb60d8a1	[inductor] Add test for debug.trace mode (#113240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113240 Approved by: https://github.com/oulgen	2023-11-08 21:50:18 +00:00
Ying Zhang	edcbd5a895	Make TORCH_COMPILE_DEBUG=1 work again (#112917 ) ATT. After the fix, self.node is `Optional[ir.Buffer]` in `FusedSchedulerNode` and `ForeachKernelSchedulerNode`, but `ir.Buffer` in `BaseSchedulerNode`. Using `ir.Buffer` for `BaseSchedulerNode.node` avoids all mypy complaints about Optionals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112917 Approved by: https://github.com/davidberard98, https://github.com/int3, https://github.com/leslie-fang-intel, https://github.com/aakhundov	2023-11-07 23:34:30 +00:00
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Kai Londenberg	bdfde62e54	[Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) (#110890 ) Summary: This PR adds epilogue fusion code generation support for the new experimental [Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]). Details: A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and performs the computation of the fused Pointwise / Elementwise computation nodes. This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu), which is currently the only documentation and example of Cutlass Epilogue Visitor Trees. This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform. A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments to each of the functor subexpressions. Step 1 functionality: * End to end code generation is possible using the above approach. * Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants ) after a matmul. * Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc. * Examples / Unit tests include ReLU and ReLU6 fusion. * Support for fp16 and fp16 with fp32 accumulation data types. * Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 ) The following is not yet supported, and is left for future work: * Full operation support ( e.g. full set of all ops usually handled via V.ops handlers ) * Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented ) * Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature * Add support for additional (auxiliary) outputs ( requires support for full computation graphs ) * Add support for reduction operations and operations which use different output layouts than the input * Add support for additional dtypes ( as far as Cutlass allows ) This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features for the inductor backend. See also Cutlass release notes: https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2 Notable changes in Cutlass 3.2.1 include: * Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to prevent namespace clashes without resolving to monkey-patching ( which was done earlier ). * Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet ) * Small API changes to the cutlass_library API ( requires adapting the inductor backend code ) Notable changes in Cutlass 3.2.2 include: * Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention Test Plan: * CI * pytest test/inductor/test_max_autotune.py Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled. Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890 Approved by: https://github.com/jansel ghstack dependencies: #112762	2023-11-06 19:42:10 +00:00
Oguz Ulgen	67e8762e83	[Inductor] Kill has_aliasing (#112875 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112875 Approved by: https://github.com/Chillee	2023-11-03 23:22:22 +00:00
Oguz Ulgen	001573b687	[Inductor] Support one node creating multiple mutations in scheduler (#112547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112547 Approved by: https://github.com/Chillee	2023-11-03 16:01:31 +00:00
Oguz Ulgen	7f77ec37be	[Inductor] Clarify mutation related comments (#112466 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112466 Approved by: https://github.com/Chillee	2023-11-01 18:39:58 +00:00
Shunting Zhang	a1e222ef02	metric table (#109245 ) In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem. Q: why not log to stdout/err direclty A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful. Q: what's the difference with speedup_indcutor.csv A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful. Example use cases: - As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface: https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 . - To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful. - gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-11-01 02:33:42 +00:00
Jon Chuang	a21851c69d	fix(inductor): `ForeachKernelSchedulerNode` group shape should be opaque for graph debug (#110336 ) ~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~ Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`. Previously: silent error / confusing shape of (0,) ![image](https://github.com/pytorch/pytorch/assets/9093549/5bc2a3c7-151f-4433-bbf8-044c7b03e989) Now: clear that it is foreach which does not have well-defined shape: ![image](https://github.com/pytorch/pytorch/assets/9093549/8373080d-4519-4e74-8a3b-da463e9968da) ~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110336 Approved by: https://github.com/mlazos	2023-10-31 18:44:08 +00:00
Shunting Zhang	fbafff3668	[reland][inductor] benchmark fusion (#112450 ) reland https://github.com/pytorch/pytorch/pull/108193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112450 Approved by: https://github.com/jansel	2023-10-31 18:17:06 +00:00
Ying Zhang	6ab1121bdc	Enable Mypy checking for scheduler.py (#105600 ) ATT, add type annotations and type assertions to pass Mypy checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105600 Approved by: https://github.com/int3	2023-10-31 01:47:13 +00:00
Levy Zhao	589625cbae	Add bandwidth to extern kernel calc (#110539 ) Summary: - Modify the result of get_estimated_runtime() for ExternKernelSchedulerNode to count both bytes and FLOPs and return the maximum of the two. Reviewed By: xmfan Differential Revision: D48987490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110539 Approved by: https://github.com/xw285cornell	2023-10-27 04:46:24 +00:00
PyTorch MergeBot	64fd027f2e	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `73cc5d1cdd`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1782157638))	2023-10-27 01:40:06 +00:00
Shunting Zhang	73cc5d1cdd	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 22:18:37 +00:00
PyTorch MergeBot	485cc0faae	Revert "[inductor] benchmark fusion (#108193 )" This reverts commit `ec0cdcdf6a`. Reverted https://github.com/pytorch/pytorch/pull/108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](https://github.com/pytorch/pytorch/pull/108193#issuecomment-1781473282))	2023-10-26 16:41:20 +00:00
Shunting Zhang	ec0cdcdf6a	[inductor] benchmark fusion (#108193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108193 Approved by: https://github.com/jansel	2023-10-26 04:14:22 +00:00
Edward Z. Yang	6c384cf4a6	Don't DCE unbacked SymInt if it is returned as shape constant buffer (#111803 ) Also adds some logging for the inductor scheduler Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/111803 Approved by: https://github.com/jansel	2023-10-23 19:57:38 +00:00
Michael Lazos	543dc75746	[Reland] horizontal concat fusion (#111437 ) Reland https://github.com/pytorch/pytorch/pull/108115 The main fix is to disallow nop nodes to be included in foreach scheduler nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/111437 Approved by: https://github.com/yanboliang	2023-10-18 17:09:01 +00:00
chilli	e942fddb83	Fix get_estimated_runtime for symbolic shapes (#111314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111314 Approved by: https://github.com/lezcano	2023-10-15 05:40:03 +00:00
Will Feng	b28cb43f5c	Intra-graph reordering pass on Inductor scheduler IR (based on #100762 ) (#108091 ) This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762. Main algorithm: 1. Greedily moves waits as late as possible (i.e. until we reach a use) 2. Greedily moves comms as early as possible (i.e. until we reach an input) 3. Move computes following simple heuristics to improve overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2023-10-14 14:51:24 +00:00
Adnan Akhundov	d7317d8a11	Fix size_hint call sites failing on unbacked SymInts (#110520 ) Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520 Approved by: https://github.com/jansel, https://github.com/ezyang	2023-10-14 08:10:09 +00:00
Jon Chuang	9c7f464eef	[inductor]: Better debugging of `can_fuse` decisions with `TORCH_LOGS=fusion` (#110415 ) Fixes https://github.com/pytorch/pytorch/issues/110393 Example logs (for adagrad on main). In this case, it clearly identifies device mismatch as a potential red flag, which is indeed the obstacle to adagrad's successful fusion. (see: https://github.com/pytorch/pytorch/pull/110339) ``` [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (1/10): 18 nodes ===== [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,084] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (foreach:3): candidate consumer has no dep in any foreach producer [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 13 possible fusions: [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7)) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf8')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf10')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf12')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf14')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf9')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf4_buf5_buf6_buf7), SchedulerNode(name='buf11')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf13')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (ForeachKernelSchedulerNode(nodes=buf0_buf1_buf2_buf3), SchedulerNode(name='buf15')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf25'), SchedulerNode(name='buf33')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf43'), SchedulerNode(name='buf51')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf34'), SchedulerNode(name='buf42')) [2023-10-03 21:50:24,085] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] (SchedulerNode(name='buf16'), SchedulerNode(name='buf24')) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (1/10): fused 18 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== attempting fusion (2/10): 5 nodes ===== [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] cannot fuse (7): device mismatch (node1: cuda:0, node2: cpu) [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] 0 possible fusions: [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] completed fusion round (2/10): fused 5 nodes into 5 nodes [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] [2023-10-03 21:50:24,087] [0/0] torch._inductor.scheduler.__schedule: [DEBUG] ===== fusion complete (2 iterations) ===== ``` CC @jansel @ngimel @mlazos @shunting314 @peterbell10 as code owners Pull Request resolved: https://github.com/pytorch/pytorch/pull/110415 Approved by: https://github.com/mlazos	2023-10-13 00:36:45 +00:00
leslie-fang-intel	a11d4a8378	[Reland] [Inductor] Break the loop fusion when node2 depends on node1 mutations (#110677 ) Reland PR https://github.com/pytorch/pytorch/pull/109172 which has been reverted in https://github.com/pytorch/pytorch/pull/110622 Differential Revision: [D50097373](https://our.internmc.facebook.com/intern/diff/D50097373) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110677 Approved by: https://github.com/jgong5, https://github.com/ezyang	2023-10-11 00:26:45 +00:00
eellison	c5f06b9753	Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651 ) - neg view can just be lowered to neg() post functionalization - we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007	2023-10-10 16:34:01 +00:00
Edward Z. Yang	f274c7b32c	Add functional collective all_to_all_single and support it in Inductor (#110195 ) Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225 rebased on top of item() support changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195 Approved by: https://github.com/Skylion007	2023-10-05 23:11:51 +00:00

1 2 3

139 Commits