This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.
The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
Fixes https://github.com/pytorch/pytorch/issues/118129
Suppressions automatically added with
```
import re
with open("error_file.txt", "r") as f:
errors = f.readlines()
error_lines = {}
for error in errors:
match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
if match:
file_path, line_number, error_type = match.groups()
if file_path not in error_lines:
error_lines[file_path] = {}
error_lines[file_path][int(line_number)] = error_type
for file_path, lines in error_lines.items():
with open(file_path, "r") as f:
code = f.readlines()
for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n"
with open(file_path, "w") as f:
f.writelines(code)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism.
Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration.
Fixes#113130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421
Approved by: https://github.com/oulgen, https://github.com/aakhundov
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
Summary:
An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list.
Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics.
Test Plan: ad hoc testing
Differential Revision: D52060659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609
Approved by: https://github.com/jansel
Summary: When the model takes no inputs, AOTInductor relies on checking weights to figure out which device to compile the model into. Currently recording buffer device type happens too late, and this PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114682
Approved by: https://github.com/chenyang78
This was invaluable when I was debugging #114917. Without the node names
in the log message, it was difficult to make sense of them.
However, I did not want to bloat the number of LOC with this change.
Thus, instead of calling `debug()` directly with the node arguments, I
made a new callable class WhyNoFuse to partially apply the node
arguments at the top of each fusion-checking method. WhyNoFuse generates
the logging string only when its `__str__` method gets called, so there
is minimal overhead when logging is disabled.
I also removed the various logging 'tags' like "vert:1" / "triton:1" --
the log messages themselves are unique enough that the user can identify
them without the tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003
Approved by: https://github.com/Skylion007
This reduces compile time of Adam on 1k parameters from 180s to 140s (28%), the main reason being that thousands of buffers no longer get sent to the scheduler.
The idea behind this is that if a destination buffer (from a copy_) has no users, it shouldn't matter if dst aliases src.
This is implemented by reinplacing copy_ nodes when safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112440
Approved by: https://github.com/jansel
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
This needs some special handling because we don't actually allocate
boolean symbols in sympy; we allocate an integer indicator variable.
See comment for more details.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114157
Approved by: https://github.com/ydwu4
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
Summary:
This PR adds epilogue fusion code generation support for the new experimental
[Inductor Cutlass backend]([https://github.com/pytorch/pytorch/pull/108015]).
Details:
A fusion happens on the GEMM template level by taking a Cutlass 3.x GEMM Universal Matmul Kernel template
and adding a custom template functor based on Cutlass new “Epilogue Visitor Trees” (EVT) on top, which represents and
performs the computation of the fused Pointwise / Elementwise computation nodes.
This is the approach dictated by [NVIDIA/cutlass example 49](https://github.com/NVIDIA/cutlass/blob/main/examples/49_hopper_gemm_with_collective_builder/49_collective_builder.cu),
which is currently the only documentation and example of Cutlass Epilogue Visitor Trees.
This EVT functor in turn is a hierarchical template expression which represents an abstract syntax tree of the fused computation to perform.
A second codegen task is to create a hierarchical initializer expression, which provides potentially necessary arguments
to each of the functor subexpressions.
Step 1 functionality:
* End to end code generation is possible using the above approach.
* Supports simple elementwise expression fusion of chains of elementwise operations (with scalar constants )
after a matmul.
* Elementwise operation support includes addition, subtraction, multiplication, division, minimum, maximum etc.
* Examples / Unit tests include ReLU and ReLU6 fusion.
* Support for fp16 and fp16 with fp32 accumulation data types.
* Generates SM90 ( Hopper ) based CUDA Kernels ( as Cutlass up to 3.2.0 only supported EVT for SM90 )
The following is not yet supported, and is left for future work:
* Full operation support ( e.g. full set of all ops usually handled via V.ops handlers )
* Cutlass EVT with SM80 support ( possible in Cutlass 3.2.1 according to release notes, but not yet documented )
* Add support for additional (auxiliary) inputs, which changes the Template Kernels' call signature
* Add support for additional (auxiliary) outputs ( requires support for full computation graphs )
* Add support for reduction operations and operations which use different output layouts than the input
* Add support for additional dtypes ( as far as Cutlass allows )
This PR updates third_party/cutlass to v3.2.2, which has some important improvements and features
for the inductor backend.
See also Cutlass release notes:
https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1 and https://github.com/NVIDIA/cutlass/releases/tag/v3.2.2
Notable changes in Cutlass 3.2.1 include:
* Cutlass codegen python code has moved into a package with the "cutlass_library" namespace, which allows to
prevent namespace clashes without resolving to monkey-patching ( which was done earlier ).
* Support for SM80 epilogue visitor trees ( according to the Release Notes, not tried yet )
* Small API changes to the cutlass_library API ( requires adapting the inductor backend code )
Notable changes in Cutlass 3.2.2 include:
* Bugfix that led to CUDA Illegal memory access in some Pytorch unit tests involving flash attention
Test Plan:
* CI
* pytest test/inductor/test_max_autotune.py
Note: So far, the CUTLASS backend is still disabled by default. Benchmarks are planned once more advanced fusions are enabled.
Differential Revision: [D50988161](https://our.internmc.facebook.com/intern/diff/D50988161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110890
Approved by: https://github.com/jansel
ghstack dependencies: #112762
In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem.
Q: why not log to stdout/err direclty
A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful.
Q: what's the difference with speedup_indcutor.csv
A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful.
Example use cases:
- As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface:
https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 .
- To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful.
- gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245
Approved by: https://github.com/jansel, https://github.com/mlazos
~~Shape is assumed by `TensorMetadata` to be torch.Shape/tuple, however, some of the scheduler node groups utilize `int`, so convert to tuple.~~
Root cause is actually `foreach` scheduler node having silent-error group of int, when in fact it ought to be opaque `foreach`.
**Previously:** silent error / confusing shape of (0,)

**Now:** clear that it is foreach which does not have well-defined shape:

~~Alternate might be to create list of shapes for each of its subnodes. Actually, for debuggability sake, I may prefer this. We can ensure that the recursive generation of this string is only done dynamically in a debug code path. Else, incrementally computing it on initialization of ForeachKernel may also be feasible.~~ This is quite infeasible for 100s of params.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110336
Approved by: https://github.com/mlazos
Summary: - Modify the result of get_estimated_runtime() for ExternKernelSchedulerNode to count both bytes and FLOPs and return the maximum of the two.
Reviewed By: xmfan
Differential Revision: D48987490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110539
Approved by: https://github.com/xw285cornell
This PR implements intra-graph communication reordering pass on Inductor scheduler IR, based on Horace's previous PR #100762.
Main algorithm:
1. Greedily moves waits as late as possible (i.e. until we reach a use)
2. Greedily moves comms as early as possible (i.e. until we reach an input)
3. Move computes following simple heuristics to improve overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108091
Approved by: https://github.com/Chillee, https://github.com/wanchaol
Summary: Unbacked SymInts can't get a `sizevars.size_hint` due to being data-dependent. #109893 has added a new `fallback` parameter to `sizevars.size_hint` to specify the fallback value in cases like unbacked SymInt. In this PR we add more of those.
Test Plan: CI
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110520
Approved by: https://github.com/jansel, https://github.com/ezyang