Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.
Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.
Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
Fixes#114310 and supersedes #114748.
There are two reasons why we have quite a few special cases for `round`:
1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float.
2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259
Approved by: https://github.com/peterbell10, https://github.com/lezcano
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```
This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.
In the current code we also see:
```python
tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.
Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
- cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it
This PR is fixing compilation error with the following code:
```python
import torch
def func(x, a):
n = (a * 1.234) // 8.234
y = x + n
return y
cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
^
AssertionError()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
This PR fixed#104791
bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
Previously if two calls to cumsum were generated in the same triton kernel
we would generate identical helper functions with different names. Now this
recognizes identical functions and only defines it once. To do this I defer
choosing the name until after codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115546
Approved by: https://github.com/lezcano
ghstack dependencies: #109132
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.
Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.
Fixes https://github.com/pytorch/pytorch/issues/93631
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
This was invaluable when I was debugging #114917. Without the node names
in the log message, it was difficult to make sense of them.
However, I did not want to bloat the number of LOC with this change.
Thus, instead of calling `debug()` directly with the node arguments, I
made a new callable class WhyNoFuse to partially apply the node
arguments at the top of each fusion-checking method. WhyNoFuse generates
the logging string only when its `__str__` method gets called, so there
is minimal overhead when logging is disabled.
I also removed the various logging 'tags' like "vert:1" / "triton:1" --
the log messages themselves are unique enough that the user can identify
them without the tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115003
Approved by: https://github.com/Skylion007
Want to be a able to benchmark epilogue fused triton matmul kernel for a couple of reasons
1. @eellison found that certain TB models (resnet50, resnet152, moco) fails sometimes in maxautotune mode on the dashboard. The issue is quite hard to repro due to flakiness. The issue only get triggered when certain triton config for certain epilogue fused kernel get picked. (disable epilogue fusion bypass the issue) It would be nice if we can have a runnable script that directly run that kernel to ease further debugging
2. this is a necessary piece to do benchmark fusion for triton matmul kernels. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler for this
Example runnable kernel script: https://gist.github.com/shunting314/00bdbc1b6b46bfa73d1389d8f40cd669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114809
Approved by: https://github.com/eellison
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
torch.split(x, l) fails when l's shape is the unbacked symint.
E.g. l =
y.tolist() makes l the unbacked shape, because l depends on the
data access of y. The downdtream call `SliceView.create()`
evaluates the shape even if the input shape is unbacked symint,
which brings up the bug.
Test Plan:
python test/inductor/test_unbacked_symints.py -k test_split_with_sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113406
Approved by: https://github.com/aakhundov, https://github.com/ezyang
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
Previously it lacked a type hint and so was treated as an Any type. This
resulted in a lot of untyped code downstream as V.graph is referenced in
many places in inductor code. I've typed it properly now as
GraphLowering, and fixed the numerous type errors this surfaced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114025
Approved by: https://github.com/eellison
ghstack dependencies: #114013
Fixes#97361
When fused kernel more than 1024 parameters, it should throw error from ctypes.
Limit args number is should be a mechanism to protect stack memory. As we known, CPP is passing args via stack memory, and stack memory has size limitation.
Code change:
1. cpp backend will check the fused nodes' args number, if it is reach the limitation. It will status flush status to ready.
2. scheduler will check `ready_to_flush` API and help backend flush codegen.
3. Add `ready_to_flush` API to `BaseScheduling`, Triton backend will return False due to not support it yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113131
Approved by: https://github.com/jgong5, https://github.com/mlazos
`install_config_module` makes a regular module into a ConfigModule with
extra methods defined on it. mypy thinks those extra methods (or module
functions) are undefined since it cannot analyze something so
dynamic. As a workaround, I've created a fake module that defines these
extra functions, which I import into the config modules during type
checking.
As part of this change, I've also added more types to config_utils.py
and enabled typechecking for torch/_dynamo/config.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112130
Approved by: https://github.com/jansel
This PR is spilt out of https://github.com/pytorch/pytorch/pull/108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).
Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.
Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112091
Approved by: https://github.com/eellison, https://github.com/jansel
triton_meta is intended to be passed directly to triton. Previous we were also putting other metadata into triton_meta; but we should split out the other metadata into a separate dict to avoid possible conficts in the future.
This PR splits out triton_meta and inductor_meta so we have a place to put additional metadata that isn't intended to be passed to triton.
Tests - wait for CI
Differential Revision: [D50864493](https://our.internmc.facebook.com/intern/diff/D50864493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112351
Approved by: https://github.com/eellison
In dynamo/inductor, sometimes it helps to gather metrics/statistics for each model in different levels like model level, graph level, kernel level or pair of fusion nodes level. This kind of thing will be very easy to do with Scuba, but we only have scuba in fbcode. This PR build metric tables to solve part of the problem.
Q: why not log to stdout/err direclty
A: sometimes we need more structured data. E.g., it would be helpful to gather all the stats in a CSV and then do post-processing (like calculating a geomean etc.). Also metric table will tag each row with the model name which is helpful.
Q: what's the difference with speedup_indcutor.csv
A: speedup_indcutor.csv is a special case that gather statistics on model level: i.e., we have one row for each model. But recording statistics on finer grain level like graph etc. is also helpful.
Example use cases:
- As a followup on the bechmark fusion PR, I want to gather all the 'slow' fusion and analyze them. With the metric table, I can easily log slow fusion for each model into a csv file. Here is the log gathered for huggingface:
https://gist.github.com/shunting314/964e73cc98368b301414ec7b7ad4c702 .
- To help understand the effect of 'loop ordering after fusion' PR, it would be helpful to gather stats like how many fusions happens for each graph. Previously we log the metric to stderr directly. But logging these metrics in a structural way is useful.
- gather number of registers, register spills, shared memory usage for each kernel in each model with runnable kernel code logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109245
Approved by: https://github.com/jansel, https://github.com/mlazos
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
return (a * b).sum(dim=-1)
N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```
This generates code like this: http://ix.io/4HFs
```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
Fixes#109196
When we have a split reduction and the tensor is not an even multiple of the split size,
we use `ops.masked` to pad to an even multiple. In the case here we generated:
```python
tmp5 = tl.where(mask, tmp4, 0)
```
which implicitly promotes our boolean value to `int32`. The fix is to give the default
value the same dtype as `result`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109325
Approved by: https://github.com/lezcano