This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.
Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.
Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254
Fixes the Inductor max-autotune mode failures of the below models:
- GPT2ForSequenceClassification
- PegasusForConditionalGeneration
- XGLMForCausalLM
- hf_GPT2
- tnt_s_patch16_224
```log
File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true
evaluated = self.shape_env._maybe_evaluate_static(
File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper
return fn_cache(self, *args, **kwargs)
File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static
vr = var_ranges[k]
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
KeyError: m_start
```
The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128
Approved by: https://github.com/jgong5, https://github.com/jansel
The current code assumes that indirect variables will be created by the
same `IndexPropagation` instance, however that isn't true in the case of
masked sub-blocks where we take in variables from the parent block.
This fixes the issue by moving the var range information up to the
`LoopBody` object where it can be shared by all sub-blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984
Approved by: https://github.com/lezcano
Tries to fix#127677.
# Context
Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```
We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
return (index_1,)
```
which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
tmp1 = ks0
tmp2 = tmp0 + tmp1
tmp3 = tmp0 < 0
tmp4 = tl.where(tmp3, tmp2, tmp0)
# check_bounds()
tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")
def call():
arg0_1, arg1_1, arg2_1 = args
buf1 = aten.repeat_interleave.Tensor(arg1_1)
buf4 = empty_strided_cuda((u0, 64), (64, 1))
triton_poi_fused_index_select_0.run(
buf1, arg2_1, buf4, s0,
triton_poi_fused_index_select_0_xnumel,
grid=grid(triton_poi_fused_index_select_0_xnumel),
stream=stream0)
```
# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)
When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.
# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.
Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).
The new strategy is as follows:
- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
- This makes them always be codegen'd before the load and stores
- In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.
We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.
For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.
Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))`
which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that
`sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead
get a positive zero constant.
Here, I work around this by doing the constant folding with python arithmetic
which does respect signed zeros.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031
Approved by: https://github.com/lezcano
We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times.
Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is
horribly slow. It turns out though that there is another propery `is_number` that does what we want.
> property is_number:
>
> Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster
> than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined
> function.
Even further, we also avoid the overhead of building the unnecessary set object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688
Approved by: https://github.com/lezcano
I found that the upsample bicubic lowering was generating this line
```python
ops.index_expr(0.244094488188976*x0, torch.float32)
```
which is not good because triton's `ops.index_expr` expects integer expressions and dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105021
Approved by: https://github.com/lezcano
This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously.
Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725
Approved by: https://github.com/lezcano
This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing
into direct indexing expressions. I also add support to the cpp printer for
Min/Max and fix the triton printer to support multi-argument Min/Max.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020
Approved by: https://github.com/lezcano
This pass does a limited form of constant propagation, as well as propagation of
sympy indexing expressions. For example, say you have the function:
```python
def flip(x):
i = torch.arange(x.size(0) - 1, -1, -1, device=x.device)
return x[i]
```
On current main this results in indirect indexing:
```python
class buf0_loop_body:
var_ranges = {z0: 4, z1: 3}
index0 = 3 - z0
index1 = 3*indirect0 + z1
index2 = 3*z0 + z1
def body(self, ops):
get_index = self.get_index('index0')
index_expr = ops.index_expr(get_index, torch.int64)
set_indirect0 = self.set_indirect0(index_expr)
get_index_1 = self.get_index('index1')
load = ops.load('arg0_1', get_index_1)
get_index_2 = self.get_index('index2')
store = ops.store('buf0', get_index_2, load, None)
return store
```
With this PR the indexing is propagated through the computation and into direct
indexing:
```python
class buf0_loop_body:
var_ranges = {z0: 4, z1: 3}
index0 = -3*z0 + z1 + 9
index1 = 3*z0 + z1
def body(self, ops):
get_index = self.get_index('index0')
load = ops.load('arg0_1', get_index)
get_index_1 = self.get_index('index1')
store = ops.store('buf0', get_index_1, load, None)
return store
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101077
Approved by: https://github.com/lezcano, https://github.com/ngimel