Commit Graph

419 Commits

Author SHA1 Message Date
Aaron Gokaslan
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
PyTorch MergeBot
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
Pearu Peterson
2327879fb6 Add lowering to special.bessel_j0 (2nd try) (#118565)
This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565
Approved by: https://github.com/peterbell10
2024-01-30 15:26:59 +00:00
Edward Z. Yang
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
Edward Z. Yang
cad79bd0bb Remove follow_imports = skip from sympy (#118469)
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.

The suppressions were added automatically with the following script generated by GPT-4:

```
import re

# Read the error file
with open("error_file.txt", "r") as f:
    errors = f.readlines()

# Parse the lines with errors and error types
error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
2024-01-28 13:38:38 +00:00
Edward Z. Yang
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
Edward Z. Yang
d03173e88c Unify MYPYINDUCTOR and MYPY (#118432)
The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this.

Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418
2024-01-27 17:23:20 +00:00
eellison
b95c45fbf7 add stack trace to device skip (#118112)
Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112
Approved by: https://github.com/bdhirsh
2024-01-26 22:33:48 +00:00
Peter Bell
f129e3fe03 [inductor] Handle cum{sum,prod} on zero-dim tensors (#117990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990
Approved by: https://github.com/lezcano
2024-01-26 22:21:42 +00:00
Edward Z. Yang
25f72194e8 Realize inputs to DynamicScalar before unwrapping storage (#118125)
Fixes https://github.com/pytorch/pytorch/issues/118102

Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #117862
2024-01-26 18:08:03 +00:00
leslie-fang-intel
b66c4eda61 [Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-01-26 12:43:25 +00:00
Michael Lazos
aaae2d8bb6 Add compilable and capturable foreach adamax with tests (#117835)
Based off of https://github.com/pytorch/pytorch/pull/110345

Fixes https://github.com/pytorch/pytorch/issues/117812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835
Approved by: https://github.com/janeyx99
2024-01-20 05:29:05 +00:00
vfdev-5
d0fc268918 Fixed issue in upsample_nearestnd lowering with scales (#117538)
Fixed #116848

Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264

Originally, the code was
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales):
        if scale:
            scales[i] = scale
```
which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output.
This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales_x):
        if scale:
            scales[i] = scale
```
however, this leads to a wrong scale value as it should be inverted as (1 / scale).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538
Approved by: https://github.com/peterbell10
2024-01-17 18:14:35 +00:00
Peter Bell
7a8013fbfa [inductor] Handle more edge cases in slice and slice_scatter (#117377)
Fixes #117110

When slicing we can end up with start and end which are out of bounds, which is
handled in python slicing by clamping to the correct bounds. There is also the
case where end < start which should result in an empty slice.

In the isoneutral_mixing failure we have the second case, with `start=2, end=0`
which in `slice_scatter` became `src_size[dim] = -2`.

This PR improves slice's edge case handling and factors the start and end
normalization code out so it can be shared with slice_scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377
Approved by: https://github.com/lezcano
2024-01-15 17:05:48 +00:00
Adnan Akhundov
c3e2b94827 Realize non-ReinterpretView Views in custom Triton kernel args (#117468)
Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input.

This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro).

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input
...
----------------------------------------------------------------------
Ran 1 test in 3.909s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468
Approved by: https://github.com/oulgen
2024-01-14 23:31:38 +00:00
rzou
cb42bc705b Make auto_functionalized HOP fallback in inductor (#117084)
It looks like the inductor fallback previously worked with HOPs but no longer
does, so I fixed that:
- all HOPs are exposed under torch.ops.higher_order, so I changed how
  inductor looks them up
- the inductor fallback assumed that an operator's signature was (*args,
  **kwargs). This is true for all the OpOverloads but not HOPs. I
  rewrote the code to not rely on this.

Test Plan:
- existing tests
- new test for auto_functionalized HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084
Approved by: https://github.com/williamwen42
2024-01-12 17:57:01 +00:00
Sun, Jiayi
9f57cf502f [inductor][cpu]disable pointwise_cat on CPU (#116313)
We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr.

This PR fix the following three regression issues:
https://github.com/pytorch/pytorch/issues/115827
https://github.com/pytorch/pytorch/issues/112139
https://github.com/pytorch/pytorch/issues/114495

and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-11 08:00:00 +00:00
PyTorch MergeBot
1174e82bde Revert "Add _assert_scalar and teach Inductor to codegen it (#114148)"
This reverts commit b6028acfa4.

Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))
2024-01-11 02:30:22 +00:00
Edward Z. Yang
b6028acfa4 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
2024-01-09 23:21:26 +00:00
Jason Ansel
94363cee41 [inductor] Indexing refactors (#116078)
Perf differences seems to be noise:
![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078
Approved by: https://github.com/aakhundov
2024-01-09 19:06:51 +00:00
Valentine233
20c2ec9a15 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-07 04:58:23 +00:00
PyTorch MergeBot
2ccc7af028 Revert "[CPU] Add flash attention mask version (#115913)"
This reverts commit 76a3fbb709.

Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))
2024-01-05 02:39:12 +00:00
Valentine233
76a3fbb709 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-05 01:27:36 +00:00
leslie-fang-intel
81cebca3d2 [Inductor] [Quant] Fix QConv Binary Inplace Layout Issue (#115613)
This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem:

- As the changes made in 611a7457ca, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`.

- Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #116172
2023-12-24 08:04:29 +00:00
Peter Bell
4f4b931aba [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-23 01:06:43 +00:00
Aaron Meurer
f08c4da86d Add a decomposition for take() (#114813)
Presumably this can close https://github.com/pytorch/pytorch/pull/109784

Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there).

There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-22 18:14:57 +00:00
Yifu Wang
718b576e2c Port all_to_all_single to native c10d_functional (#113438)
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
2023-12-22 08:12:13 +00:00
Adnan Akhundov
247f9c3de4 Preserve strides of custom Triton kernel args (#116219)
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).

In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.

Test Plan:

```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
2023-12-21 22:46:32 +00:00
vfdev-5
b72127cd4b [inductor] Support sym exprs in lowering constant promotion (#116196)
Follow-up to https://github.com/pytorch/pytorch/pull/115920

This PR fixes the error with symbolic expression in aten.div:
```python
import torch
aten = torch.ops.aten

def func(x, a):
    return aten.div(x * 0.5, a, rounding_mode=None)

cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cpu"
x = 124
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message:
```
  File "/pytorch/torch/_inductor/graph.py", line 700, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode
    return div(a, b)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4857, in div
    a, b = promote_constants(
  File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: StopIteration:
  target: aten.div.Tensor_mode
  args[0]: 1.0*s0
  args[1]: s1
  kwargs: {'rounding_mode': None}

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196
Approved by: https://github.com/peterbell10
2023-12-20 21:59:51 +00:00
PyTorch MergeBot
c215e59bf2 Revert "[inductor] Avoid bool being upcast to int (#109913)"
This reverts commit 92998693a9.

Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))
2023-12-20 12:33:50 +00:00
Elias Ellison
9a2a44457a SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069)
The logic to check alignment for realized tensors in the backward can be extended for realized tensors in the forward. This fixes an interaction with freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116069
Approved by: https://github.com/drisspg
2023-12-20 00:14:20 +00:00
Peter Bell
92998693a9 [inductor] Avoid bool being upcast to int (#109913)
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```

This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.

In the current code we also see:
```python
     tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.

Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:

```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
2023-12-19 14:16:10 +00:00
Isuru Fernando
8b0122ad33 Add lowerings for reflection_pad{1, 3}d_backward (#115645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115645
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-12-19 04:05:10 +00:00
vfdev-5
2a2f2e454a [inductor] Fixed issue with true div on integer input with dyn shapes (#115920)
Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8`

Description:
- Fixed issue with true div on integer input with dyn shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920
Approved by: https://github.com/peterbell10
2023-12-16 02:06:39 +00:00
PyTorch MergeBot
ca4caf4eac Revert "[inductor] Do variance calculation in opmath type (#115181)"
This reverts commit 42390a097b.

Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))
2023-12-14 18:21:49 +00:00
Peter Bell
ad76a4e1e7 [inductor] Allow sympy expressions to participate in type promotion (#115676)
In the test example we have `add(i64[10], sympy.Expr)` where
`sympy.Expr` is not considered a promoting arg so isn't factored into
the type promotion. However, in eager it would promote to float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699, #115700
2023-12-13 22:22:37 +00:00
Peter Bell
42390a097b [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-13 18:40:44 +00:00
Yang Chen
1392843e7b [inductor] make sure bitcast input and target type have the same bitwidth (#115619)
This PR fixed #104791

bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
2023-12-13 00:53:04 +00:00
Peter Bell
02196c21ac [inductor] Parameterize ir.Scan on combine_fn (#109132)
This replaces `tl.cumsum` and `tl.cumprod` with calls to `tl.associative_scan`
where the combine function is generated from inductor IR.

So before we had:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.cumsum(tmp2, 1)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Now we have:
```python
@triton.jit
def _triton_helper_fn0(arg0, arg1):
    tmp0 = tmp0 + tmp1
    return tmp0

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 20
    rnumel = 30
    RBLOCK: tl.constexpr = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (30*x0)), rmask & xmask, other=0).to(tl.float32)
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
    tmp2 = tl.where(rmask & xmask, tmp1, 0)
    tmp3 = tl.associative_scan(tmp2, 1, _triton_helper_fn0)
    tl.store(out_ptr0 + (r1 + (30*x0)), tmp3, rmask & xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109132
Approved by: https://github.com/lezcano
2023-12-12 16:30:50 +00:00
Isuru Fernando
505574c46a Add decomposition for torch.block_diag (#115096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115096
Approved by: https://github.com/peterbell10
2023-12-11 20:04:22 +00:00
Isuru Fernando
d40a7c6026 Add decompositions for replication_pad (#115113)
Fixes #115395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115113
Approved by: https://github.com/peterbell10
2023-12-09 02:44:07 +00:00
Isuru Fernando
fb19947962 Add decompositions for reflection_pad{1, 2, 3}d (#115100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115100
Approved by: https://github.com/peterbell10
2023-12-08 23:05:57 +00:00
Peter Bell
7aac689b19 [inductor] Add ir.Scan and lower aten.cumsum on CUDA (#106581)
This adds the `ir.Scan` node (currently only supported on CUDA) which re-uses the existing reduction kernel machinery to support different kinds of non-pointwise ops. Just like reductions it supports prologue and epilogue fusions and has both persistent and non-persistent kernel generation.

Currently this doesn't support the equivalent of `Reduction.create_multilayer` and will instead fall back to eager in those cases. This is because splitting into multiple kernel invocations ends up being far slower than cub's single kernel strategy which matches the performance of a copy kernel.

Fixes https://github.com/pytorch/pytorch/issues/93631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106581
Approved by: https://github.com/lezcano, https://github.com/atalman
2023-12-05 23:31:49 +00:00
lezcano
0a9819e3e1 Prefer is_number over is_constant() (#114513)
`is_constant` tries really hard to check whether an expression is
constant. `is_number` is often enough. Note that `sympy.nan.is_number`
is true. Same for infinities

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114513
Approved by: https://github.com/peterbell10
2023-12-05 16:56:15 +00:00
PyTorch MergeBot
0ee1e469cb Revert "Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520)"
This reverts commit 3d47b92dfb.

Reverted https://github.com/pytorch/pytorch/pull/114520 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114520#issuecomment-1840890210))
2023-12-05 14:24:30 +00:00
chilli
3d47b92dfb Modify pointwise cat heuristic to only apply when inputs are all pointwise and outputs are all pointwise (#114520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114520
Approved by: https://github.com/eellison
2023-12-02 04:02:39 +00:00
Antoni Viros
d47f715d29 Expose Flash attn to autograd (#114378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378
Approved by: https://github.com/drisspg
2023-12-01 23:42:06 +00:00
Kurt Mohler
6f32eb7eef Add decomp for replication_pad2d and use for CUDA deterministic (#111590)
Fixes #95578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590
Approved by: https://github.com/peterbell10
2023-12-01 18:56:09 +00:00
PyTorch MergeBot
013675ff59 Revert "Add decomp for replication_pad2d and use for CUDA deterministic (#111590)"
This reverts commit f1286161a6.

Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job.  The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))
2023-11-30 02:28:14 +00:00
chilli
597d3fb86a Add additional guard for index_put fallback for bfloat16 on whether it's accumulating or not (#114788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114788
Approved by: https://github.com/cpuhrsch
2023-11-30 00:33:50 +00:00