Commit Graph

54 Commits

Author SHA1 Message Date
Jiong Gong
86869475ff [inductor] move dtype propagation log to schedule artifact (#101351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101351
Approved by: https://github.com/jansel
2023-05-16 06:43:38 +00:00
chunyuan
1faef895ca Inductor cpp wrapper: support sympy.Expr as input (#101257)
Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel
2023-05-15 23:57:28 +00:00
Natalia Gimelshein
a4830bd86b fix sign return type (#101346)
Fixes #101216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101346
Approved by: https://github.com/eellison, https://github.com/jansel
2023-05-15 17:50:36 +00:00
Edward Z. Yang
beb7f79517 Fix intermediate hooks on inplace buffers, enable it in testing (#100322)
Fixes https://github.com/pytorch/pytorch/issues/100312

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322
Approved by: https://github.com/ngimel
2023-04-30 13:34:44 +00:00
Jiong Gong
4456e932f8 [inductor] fix _print_Pow given reciprocal of dynamic dim with float exponent (#100090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100090
Approved by: https://github.com/XiaobingSuper, https://github.com/jansel
2023-04-27 04:10:15 +00:00
leslie-fang-intel
deaf983bdb [Inductor][quant]Enable decomposed.quant/dequant lowering and vec code gen (#99131)
**Summary**
Since current quantization flow has not decomposed quant/dequant into prim ops, in this PR

- We enable the quant/dequant decomposition as lowering inside inductor.
- For the `decomposed.quant/dequant.tensor` overload, there are loading of scalar tensor of `zero point` and `scale`, we need to enable the vec code gen for these op overloads.
- Minor change as adding `is_load_uint8_as_float` and `is_store_float_as_uint8` default value `False` into `OptimizationContext`.

**TestPlan**
```
cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_quant_lowering
```
co-author with @Xia-Weiwen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99131
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-21 04:33:02 +00:00
XiaobingSuper
3af467eff4 inductor: support sqrt for dynamic shape (#99514)
When running TIMM ```convit_base``` dynamic shape case, there is always has AssertionError, see https://github.com/pytorch/pytorch/issues/97877.

A simple reproduce code is:
```
import torch
import torch._dynamo
import torch._dynamo.config as config

config.dynamic_shapes=True
torch._dynamo.config.assume_static_by_default=False
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        B, N, C = x.shape
        return self.get_rel_indices(N)

    def get_rel_indices(self, num_patches: int) -> torch.Tensor:
        img_size = int(num_patches ** .5)
        #rel_indices = torch.zeros(1, num_patches, num_patches, 3)
        ind = torch.arange(img_size)
        return ind

model = Model().eval()
opt_model = torch._dynamo.optimize('inductor')(model)

x = torch.randn(8, 8, 8)
ref = model(x)
with torch.no_grad():
    for i in range(3):
        out = opt_model(x)

```

After this code, the generated code will be like this:
```

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/x5/cx5442c6dcuxsrrlnqi476yzjlgc6g53ukppuaettiyp6dszhmr4.h"
extern "C" void kernel(long* out_ptr0,
                       const long ks0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(std::floor(std::sqrt(ks0))); i0+=static_cast<long>(1L))
        {
            auto tmp0 = static_cast<long>(i0);
            out_ptr0[static_cast<long>(i0)] = tmp0;
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99514
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-04-20 04:22:49 +00:00
lezcano
495e1b4d0e Add device_asserts before indirect loads and stores (#98590)
This PR also adds a way to CSE statements (not only assignments).

The tests follow the pattern from https://github.com/openai/triton/pull/1143
They take a fair amount of time to run (90s in my box). If we wanted to
improve this, we could avoid testing the `ndim == 3` case.

Changes like this one make me hope that we get to clean the amount of
lowerings we have at some point...

Generated code for `x[y]` with `x.shape == (3, 2, 4),  y.ndim == 1`:

With `dynamic=False`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < 3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < 3")
tmp1 = tl.load(in_ptr1 + (x0 + (8*tmp0)), xmask)
```
With `dynamic=True`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tmp1 = tl.load(in_ptr1 + (x0 + (ks1*ks2*tmp0)), xmask)
```

Generated code for `x[y+1, y+1]` with `x.shape == (3, 2, 4),  y.ndim == (3, 3)`:
With `dynamic=False` (note how it folds the two upper bounds to `min(3, 2) == 2`
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = 1
tmp2 = tmp0 + tmp1
tl.device_assert(((0 <= tmp2) & (tmp2 < 2)) | (~xmask), f"index out of bounds: 0 <= tmp2 < 2")
tmp3 = tl.load(in_ptr1 + (x0 + (12*tmp2)), xmask)
```

With `dynamic=True`:
```python
tl.device_assert(((0 <= tmp2) & (tmp2 < min(ks2, k1))) | (~xmask), f"index out of bounds: 0 <= tmp2 < min(ks2, ks1)")
```

The same works when the CSE'd variable appears 3 or more times, but then it generates `min(ks0, min(ks1, ks2))`

Generated code for `x[y] = z` with `x.ndim = 3`, `y.ndim = 1` and dynamic shapes
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = tl.load(in_ptr1 + (x2), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tl.store(out_ptr0 + (x0 + (ks1*ks2*tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask)
```

Fixes https://github.com/pytorch/pytorch/issues/93538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98590
Approved by: https://github.com/ngimel
2023-04-19 21:26:57 +00:00
haozhe.zhu
59e343b12c enable data type propagation (#98065)
Enable data type propagation in schedule node level.
Propagation policy:
(1) ops with dtype args [constant, load, rand, randn] -> direct use dtype as node dtype
(2) ops semantics decide output dtype -> using output dtype
All `override_return_dtype` in https://github.com/pytorch/pytorch/blob/master/torch/_inductor/lowering.py.
(3) other ops: perform promote on input nodes dtype. ADD(BF16, FP32) -> FP32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98065
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5
2023-04-18 00:07:35 +00:00
Peter Bell
013c7f5ba4 [inductor] Move tl.broadcast call out codegen.common (#98304)
This makes only a cosmetic change to the generated code, but means
triton's broadcasting logic doesn't leak out into the CSE class.

Before:
```python
    tmp5_load = tl.load(in_ptr1 + (0))
    tmp5 = tl.broadcast_to(tmp5_load, [XBLOCK])
```

After:
```python
    tmp5 = tl.load(in_ptr1 + (0))
    tmp6 = tl.broadcast_to(tmp5, [XBLOCK])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98304
Approved by: https://github.com/ngimel
2023-04-05 23:10:46 +00:00
Jason Ansel
bc86af0d37 Remove DeferredIndentedBuffer (#97616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616
Approved by: https://github.com/desertfire
2023-03-28 23:13:41 +00:00
Natalia Gimelshein
e7d9331688 [inductor] hoist symbolic padding expressions (#97099)
Towards fixing pnasnet5large, see #96709. The generated kernel looks much better
```
@pointwise(size_hints=[1048576], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32', 4: 'i32', 5: 'i32', 6: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 6), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, ks0, ks1, ks2, ks3, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x1 = (xindex // ks0) % ks0
    x0 = xindex % ks0
    x2 = (xindex // ks3)
    x4 = xindex
    tmp0 = x1 + ((-1)*ks1)
    tmp1 = 0
    tmp2 = tmp0 >= tmp1
    tmp3 = ks2
    tmp4 = tmp0 < tmp3
    tmp5 = x0 + ((-1)*ks1)
    tmp6 = tmp5 >= tmp1
    tmp7 = tmp5 < tmp3
    tmp8 = tmp2 & tmp4
    tmp9 = tmp8 & tmp6
    tmp10 = tmp9 & tmp7
    tmp11 = tl.load(in_ptr0 + (x0 + ((-1)*ks1) + (ks2*x1) + (x2*(ks2*ks2)) + ((-1)*ks1*ks2) + tl.zeros([XBLOCK], tl.int32)), tmp10 & xmask, other=0)
    tmp12 = tl.where(tmp10, tmp11, 0.0)
    tl.store(out_ptr0 + (x4 + tl.zeros([XBLOCK], tl.int32)), tmp12, xmask)
 ```
Interestingly, removing `expand` in in index `simplify` function makes `load` expression a little bit better, but `store` fails to simplify to flat store in this case, so I'm leaving `expand` in.
 Full pnasnet still chokes on `ceiling` in batch_norm kernels, additionally, it looks like shape propagation goofs in inductor and generates overly complicated expressions, we should switch to meta data from fx graph.
 I'm still not adding `ceil` print to triton, because we should be able to hoist all indexing expression (and just printing ceil without converting to int64 doesn't work)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97099
Approved by: https://github.com/jansel
2023-03-21 21:43:32 +00:00
BowenBao
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
XiaobingSuper
6eca391e83 inductor(cpu): remove __restrict__ keyword to avoid generating wrong result when two pointer point same memory (#96492)
Fix https://github.com/pytorch/pytorch/issues/93365, https://github.com/pytorch/pytorch/issues/93357 and https://github.com/pytorch/pytorch/issues/96432. Currently, remove `__restrict__` keyword to avoid generating the wrong result, there has a draft PR https://github.com/pytorch/pytorch/pull/96404 to do some memory alias checks before adding `__restrict__ `keyword, but that PR needs to re-designed well for the logic of the memory alias checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96492
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-03-13 07:12:04 +00:00
Horace He
5bbec680d7 Fix usages of contextmanager without finally (#96170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-08 20:59:27 +00:00
Edward Z. Yang
d303665d33 Make int unspecialization actually work (#95621)
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.

The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.

* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.

Fixes https://github.com/pytorch/pytorch/issues/95469

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-03-04 01:22:08 +00:00
Natalia Gimelshein
88a31f4be6 hoist precomputed exprs from indices (#95690)
This generates compilable code for maskrcnn graph 13, with ceilings hoisted to be computed on the host. But it now fails with
```
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/symbolic_convert.py", line 379, in wrapper
    self.output.compile_subgraph(self, reason=reason)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/output_graph.py", line 562, in compile_subgraph
    pass1.foreach(stack_values)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 166, in foreach
    self(i)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 148, in __call__
    output.extend(value.reconstruct(self))
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/variables/dicts.py", line 40, in reconstruct
    codegen.create_load_python_module(collections),
TypeError: create_load_python_module() missing 1 required positional argument: 'push_null'

from user code:
   File "/scratch/ngimel/work/env/lib/python3.9/site-packages/torchvision-0.15.0a0+928b05c-py3.9-linux-x86_64.egg/torchvision/models/detection/backbone_utils.py", line 58, in forward
    x = self.fpn(x)
```
looks like we never execute this `create_load_python_module()` path for other subgraphs.
Any advice on how to fix this @voznesenskym @jansel ?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95690
Approved by: https://github.com/jansel
2023-02-28 23:32:36 +00:00
Edward Z. Yang
58648822b6 Handle int/float arguments for cpp codegen in inductor (#95533)
This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-28 03:57:35 +00:00
Xuehai Pan
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
Fabio Rocha
e116ca93e1 Run test_torchinductor*.py with implicit_fallbacks=False (#94039)
This way it errors out for ops that don't have decomps and
requires you to add explicit fallbacks to lowering.py

Turns out there are a lot, and this commit adds them as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94039
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/ngimel
2023-02-10 18:10:56 +00:00
Natalia Gimelshein
a5daea69fb teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-10 11:21:57 +00:00
PyTorch MergeBot
6007874bbb Revert "teach inductor to handle floor (#94341)"
This reverts commit e7df9aaec8.

Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related.  It fails on both PR and trunk e7df9aaec8
2023-02-09 19:31:08 +00:00
Natalia Gimelshein
e7df9aaec8 teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-09 17:09:35 +00:00
Natalia Gimelshein
ca8450849b compute dynamic tensor shapes for indexing on the host (#93872)
Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is
```
x1 = (xindex // pks0) % 64
```
instead of
```
x1 = (xindex // (1 + (((((-1) + ks0) // 4))*((((-1) + ks0) // 4))) + (2*((((-1) + ks0) // 4))))) % 64
```
with `pks0` arg computed on the host
```
ps0 = (1 + ((((-1) + s2) // 4)))*(1 + ((((-1) + s2) // 4)))
```
It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g.
```
tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0*(((((-1) + ks0) // 32))*((((-1) + ks0) // 32)))) + (2*x0*((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32)
```
Unfortunately, `unet` which is one of the examples failing with floor does the latter:
```
tmp1 = ((-1)*(1/(((-1) + (floor(2.0*(ks0//16))))))) + ((1/(((-1) + (floor(2.0*(ks0//16))))))*(ks0 // 16))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872
Approved by: https://github.com/jansel
2023-02-03 09:58:39 +00:00
Horace He
19c9b09449 Replace IndexingDiv with FloorDiv in Inductor (#92878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878
Approved by: https://github.com/ezyang
2023-01-24 15:06:22 +00:00
Natalia Gimelshein
4110900b22 let inductor generate broadcast when loading a single value (#92595)
For better perf with MLIR triton.
Changes
```
tmp32 = tl.load(seed3 + (0 + tl.zeros([XBLOCK, RBLOCK], tl.int32)), None)
```
to
```
tmp32_load = tl.load(seed3+(0)); tmp32 = tl.broadcast_to(tmp32_load, [XBLOCK, RBLOCK])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92595
Approved by: https://github.com/Chillee
2023-01-20 00:05:01 +00:00
Horace He
4f4b62e4a2 some fixes to get symbolic shapes working through inductor (#92320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92320
Approved by: https://github.com/ezyang
2023-01-19 03:09:02 +00:00
Jason Ansel
7c1c239db1 [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575)
This reverts commit 94262efc7d to reland #91105 / #90738.

Fixes https://github.com/pytorch/torchdynamo/issues/2015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575
Approved by: https://github.com/ngimel
2023-01-11 00:08:03 +00:00
Fabio Rocha
bc843682dd [inductor] New approach for computing triton load/store masks (#91241)
This PR is a new version of #89566, fixing a test failure.

Couldn't get ghstack to colaborate on updating that PR after re-opening,
so started a new one.

This changes the way masks for loads/stores are computed in triton backend of inductor.

New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and  `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.

I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.

Relative to #89566, the only change is to not include the mask variables
of arguments when the function being called is `tl.where`. The reason
being that `tl.where` is often used precisely to make sure the output
variable has valid values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241
Approved by: https://github.com/ngimel
2022-12-22 11:54:48 +00:00
Mark Saroufim
15af4b1cee Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos, https://github.com/malfet
2022-12-21 11:56:58 +00:00
PyTorch MergeBot
94262efc7d Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)"
This reverts commit d6dd2e97da.

Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds
2022-12-21 00:02:38 +00:00
Jason Ansel
d6dd2e97da [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)
https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land.  This is an exact copy that PR without using ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105
Approved by: https://github.com/ngimel
2022-12-20 02:38:23 +00:00
Peter Bell
81f351acd7 [inductor] Prevent blowup in inner_fn_str and extract_read_writes (#88933)
Currently the default `ops` handler expects strings as arguments and
just formats them into a function call template string. For complex
expressions, this can lead to exponential growth in terms. Say for
example you have:

```python
def fn(a):
   for _ in range(3)
       a = ops.mul(a, a)
   return a
```

You might expect `inner_fn_str` to contain 1 load and 3 multiplies,
but instead you find 8 loads and 7 multiplies:
```python
load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0)
```

This type of blowup is present in the lowering for
`max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352
was reported to have caused the entire compilation to hang.

This PR fixes the issue by formatting the string as a series of assignments to
variables, so for the example above, we now get:
```
tmp0 = load(arg_0, i0)
tmp1 = tmp0 * tmp0
tmp2 = tmp1 * tmp1
tmp3 = tmp2 * tmp2
return tmp3
```

Which corresponds to sequence of `ops` calls made.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933
Approved by: https://github.com/jansel
2022-12-15 15:36:52 +00:00
chunyuan
d35aa2f65a Inductor cpp wrapper: support Reduction (#88561)
For reductions, the code string in the codegen stage and the execution stage are different due to `\`.

- The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage:
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :\
                  omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                  omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

- The code string loaded during the execution (`\` will be escaped):
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

Thus we can't get the same hash value for these two pieces of code.
This PR adds a function to make the transformation escape the backslash in the codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-12-14 12:29:50 +00:00
PyTorch MergeBot
b2795d3c4e Revert "[inductor] New approach for computing triton load/store masks (#89566)"
This reverts commit c6c2de586d.

Reverted https://github.com/pytorch/pytorch/pull/89566 on behalf of https://github.com/clee2000 due to broke test_invalid_operand_issue1_cuda in inductor/test_torchinductor on https://github.com/pytorch/pytorch/actions/runs/3657444733/jobs/6181700572
2022-12-09 19:36:25 +00:00
PyTorch MergeBot
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
Fabio Rocha
c6c2de586d [inductor] New approach for computing triton load/store masks (#89566)
This PR changes the way masks for loads/stores are computed in triton backend of inductor.

New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and  `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.

I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89566
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-12-09 12:43:19 +00:00
Mark Saroufim
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
Nikita Karetnikov
226e803ecb [Inductor] handle non-positive exponents in Pow (#90146)
Fixes #90125.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90146
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-12-05 09:16:35 +00:00
Jean Schmidt
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
PyTorch MergeBot
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
Eli Uriegas
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
Wu, Chunyuan
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
Mark Saroufim
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
Peter Bell
1267dcf297 [inductor] Fix nan handling for aten.sign (#88937)
ATen gives `sign(nan) == 0` but inductor's cuda codegen would give
`sign(nan) == 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937
Approved by: https://github.com/ngimel
2022-11-21 20:56:40 +00:00
Wang, Eikan
bc716383a6 Redefine the simdlen semantic (#89263)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 09:08:16 +00:00
PyTorch MergeBot
4e1d19c5a5 Revert "Redefine the simdlen semantic: (#88482)"
This reverts commit fce6d6b3dc.

Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554
2022-11-17 04:58:53 +00:00
Wang, Eikan
fce6d6b3dc Redefine the simdlen semantic: (#88482)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-17 03:27:54 +00:00
Fabio Rocha
9262d18e1b [inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347)
This fixes https://github.com/pytorch/torchdynamo/issues/1515

To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties.

So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347
Approved by: https://github.com/jgong5, https://github.com/ngimel
2022-11-15 20:52:37 +00:00
Jongsoo Park
0544a32ba3 [inductor] fix could not find as_strided with config.triton.mm=triton (#88946)
Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels

Reviewed By: bertmaher

Differential Revision: D40836677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946
Approved by: https://github.com/jansel
2022-11-15 00:48:49 +00:00