This PR decouples the logic necessary to compute bounds on variables
from the logic that uses this info to perform the strenght analysis on
int64 variables. While doing so, it tries to minimize the number of
attributes of the class in favour of local variables.
This class is now accessible from any `LoopBody` object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549
Approved by: https://github.com/eellison
Fix https://github.com/pytorch/pytorch/issues/100830.
For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers.
This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16.
```python
import torch
import torch.nn as nn
torch.manual_seed(420)
from torch._inductor import config
x = torch.randn(1, 18, dtype=torch.bfloat16)
class ExampleModel(nn.Module):
def __init__(self):
super(ExampleModel, self).__init__()
self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False)
def forward(self, input1):
out = self.relu(input1)
# input1.copy_(out)
return out
func = ExampleModel()
with torch.no_grad():
func.train(False)
res1 = func(x) # without jit
print(res1)
jit_func = torch.compile(func)
res2 = jit_func(x)
print(res2)
```
Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`)
```
auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0));
auto tmp1 = (tmp0);
auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
auto tmp3 = (tmp2);
store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3);
tmp3.store(out_ptr1 + static_cast<long>(i0), 16);
```
Generated code with this PR:
```
auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16);
auto tmp1 = cvt_bf16_to_fp32(tmp0);
auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
auto tmp3 = cvt_fp32_to_bf16(tmp2);
tmp3.store(out_ptr0 + static_cast<long>(i0), 16);
tmp3.store(out_ptr1 + static_cast<long>(i0), 16);
```
This PR also fixed the data type propagation for `masked_subblock`.
Before the masked_subblock's dtype is propagated by its input which is wrong.
```
opcode name target args kwargs
----------- --------- --------- -------------------------- --------
call_module masked_subblock1 masked_subblock1 (and__2, -inf)
```
Now we propagated it by subblock with the same name:
```
# graph for body.subblocks['masked_subblock1']
opcode name target args kwargs
----------- --------- --------- -------------------------- --------
placeholder ops ops () {}
call_module get_index get_index ('index2',) {}
call_method load load (ops, 'arg0_1', get_index) {}
call_method to_dtype to_dtype (ops, load, torch.float32) {}
output output output (to_dtype,) {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042
Approved by: https://github.com/jgong5, https://github.com/jansel
Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed).
This PR extends buffer removal to inplaced buffers that can be removed.
Generated kernel for e.g. ln changes from
```
def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
where in_out_ptr0 is unused in the kernel to
```
def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
and corresponding allocation/reuse lines in the wrapper are removed.
The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289
Approved by: https://github.com/jansel
## Issue description
The PR https://github.com/pytorch/pytorch/pull/100064 introduces a new RNG operation process. However, it causes every `randint` to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated.
## Solution
This PR captures the seed offset and adds it to the existing `self.sizevars` structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument.
This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward.
According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X.
The following is a comparison for a simple example.
Before:
```
triton_poi_fused_0 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + 0)
tmp1 = x0
tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)
triton_poi_fused_1 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + 1)
tmp1 = x0
tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)
...''')
def call(args):
triton_poi_fused_0.run(buf0, buf1, 1024, grid=grid(1024), stream=stream0)
triton_poi_fused_1.run(buf0, buf2, 1024, grid=grid(1024), stream=stream0)
```
After:
```
triton_poi_fused_0 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + load_seed_offset)
tmp1 = x0
tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)
....
def call(args):
triton_poi_fused_0.run(buf0, buf1, 0, 1024, grid=grid(1024), stream=stream0)
triton_poi_fused_0.run(buf0, buf2, 1, 1024, grid=grid(1024), stream=stream0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102104
Approved by: https://github.com/jansel, https://github.com/ngimel
This wraps `ops` into an `OpsWrapper` object which wraps any returned
IR values into an `OpsValue` instance. This allows magic methods to
be implemented and means lowerings can write mathematical expressions much more
fluently. So instead of
```python
ops.add(ops.mul(ops.mul(ops.sub(ops.mul(_Ap2, x), _Ap3), x), x), _1)
```
we can write
```python
(_Ap2 * x - _Ap3) * x * x + _1
```
And it will translate to the equivalent `ops` calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101076
Approved by: https://github.com/lezcano, https://github.com/ngimel
Fixes#100831, fixes#100878
Previously `gen_assert_indirect_indexing` was only called on the index
expressions passed to `ops.load` and `ops.store` which means if the
variable is optimized out during lowering, we never generate the
assert. This instead makes `ops.indirect_indexing` eagerly generate
the assert statement, whether or not it will be used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100895
Approved by: https://github.com/lezcano, https://github.com/ngimel
**Summary**
Since current quantization flow has not decomposed quant/dequant into prim ops, in this PR
- We enable the quant/dequant decomposition as lowering inside inductor.
- For the `decomposed.quant/dequant.tensor` overload, there are loading of scalar tensor of `zero point` and `scale`, we need to enable the vec code gen for these op overloads.
- Minor change as adding `is_load_uint8_as_float` and `is_store_float_as_uint8` default value `False` into `OptimizationContext`.
**TestPlan**
```
cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_quant_lowering
```
co-author with @Xia-Weiwen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99131
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
This PR also adds a way to CSE statements (not only assignments).
The tests follow the pattern from https://github.com/openai/triton/pull/1143
They take a fair amount of time to run (90s in my box). If we wanted to
improve this, we could avoid testing the `ndim == 3` case.
Changes like this one make me hope that we get to clean the amount of
lowerings we have at some point...
Generated code for `x[y]` with `x.shape == (3, 2, 4), y.ndim == 1`:
With `dynamic=False`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < 3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < 3")
tmp1 = tl.load(in_ptr1 + (x0 + (8*tmp0)), xmask)
```
With `dynamic=True`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tmp1 = tl.load(in_ptr1 + (x0 + (ks1*ks2*tmp0)), xmask)
```
Generated code for `x[y+1, y+1]` with `x.shape == (3, 2, 4), y.ndim == (3, 3)`:
With `dynamic=False` (note how it folds the two upper bounds to `min(3, 2) == 2`
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = 1
tmp2 = tmp0 + tmp1
tl.device_assert(((0 <= tmp2) & (tmp2 < 2)) | (~xmask), f"index out of bounds: 0 <= tmp2 < 2")
tmp3 = tl.load(in_ptr1 + (x0 + (12*tmp2)), xmask)
```
With `dynamic=True`:
```python
tl.device_assert(((0 <= tmp2) & (tmp2 < min(ks2, k1))) | (~xmask), f"index out of bounds: 0 <= tmp2 < min(ks2, ks1)")
```
The same works when the CSE'd variable appears 3 or more times, but then it generates `min(ks0, min(ks1, ks2))`
Generated code for `x[y] = z` with `x.ndim = 3`, `y.ndim = 1` and dynamic shapes
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = tl.load(in_ptr1 + (x2), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tl.store(out_ptr0 + (x0 + (ks1*ks2*tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask)
```
Fixes https://github.com/pytorch/pytorch/issues/93538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98590
Approved by: https://github.com/ngimel
This makes only a cosmetic change to the generated code, but means
triton's broadcasting logic doesn't leak out into the CSE class.
Before:
```python
tmp5_load = tl.load(in_ptr1 + (0))
tmp5 = tl.broadcast_to(tmp5_load, [XBLOCK])
```
After:
```python
tmp5 = tl.load(in_ptr1 + (0))
tmp6 = tl.broadcast_to(tmp5, [XBLOCK])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98304
Approved by: https://github.com/ngimel
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.
The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.
* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.
Fixes https://github.com/pytorch/pytorch/issues/95469
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
This generates compilable code for maskrcnn graph 13, with ceilings hoisted to be computed on the host. But it now fails with
```
File "/scratch/ngimel/work/pytorch/torch/_dynamo/symbolic_convert.py", line 379, in wrapper
self.output.compile_subgraph(self, reason=reason)
File "/scratch/ngimel/work/pytorch/torch/_dynamo/output_graph.py", line 562, in compile_subgraph
pass1.foreach(stack_values)
File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 166, in foreach
self(i)
File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 148, in __call__
output.extend(value.reconstruct(self))
File "/scratch/ngimel/work/pytorch/torch/_dynamo/variables/dicts.py", line 40, in reconstruct
codegen.create_load_python_module(collections),
TypeError: create_load_python_module() missing 1 required positional argument: 'push_null'
from user code:
File "/scratch/ngimel/work/env/lib/python3.9/site-packages/torchvision-0.15.0a0+928b05c-py3.9-linux-x86_64.egg/torchvision/models/detection/backbone_utils.py", line 58, in forward
x = self.fpn(x)
```
looks like we never execute this `create_load_python_module()` path for other subgraphs.
Any advice on how to fix this @voznesenskym @jansel ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95690
Approved by: https://github.com/jansel
This PR is a new version of #89566, fixing a test failure.
Couldn't get ghstack to colaborate on updating that PR after re-opening,
so started a new one.
This changes the way masks for loads/stores are computed in triton backend of inductor.
New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.
I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.
Relative to #89566, the only change is to not include the mask variables
of arguments when the function being called is `tl.where`. The reason
being that `tl.where` is often used precisely to make sure the output
variable has valid values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241
Approved by: https://github.com/ngimel
Currently the default `ops` handler expects strings as arguments and
just formats them into a function call template string. For complex
expressions, this can lead to exponential growth in terms. Say for
example you have:
```python
def fn(a):
for _ in range(3)
a = ops.mul(a, a)
return a
```
You might expect `inner_fn_str` to contain 1 load and 3 multiplies,
but instead you find 8 loads and 7 multiplies:
```python
load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0)
```
This type of blowup is present in the lowering for
`max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352
was reported to have caused the entire compilation to hang.
This PR fixes the issue by formatting the string as a series of assignments to
variables, so for the example above, we now get:
```
tmp0 = load(arg_0, i0)
tmp1 = tmp0 * tmp0
tmp2 = tmp1 * tmp1
tmp3 = tmp2 * tmp2
return tmp3
```
Which corresponds to sequence of `ops` calls made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933
Approved by: https://github.com/jansel
For reductions, the code string in the codegen stage and the execution stage are different due to `\`.
- The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage:
```
#pragma omp declare reduction(argmax : struct IndexValue_1 :\
omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
```
- The code string loaded during the execution (`\` will be escaped):
```
#pragma omp declare reduction(argmax : struct IndexValue_1 : omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value, omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index) initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
```
Thus we can't get the same hash value for these two pieces of code.
This PR adds a function to make the transformation escape the backslash in the codegen stage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire