Commit Graph

359 Commits

Author SHA1 Message Date
soulitzer
8885128dcc Fix backward for SDPA NT jagged layout (#115576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115576
Approved by: https://github.com/jbschlosser, https://github.com/ani300
2023-12-12 18:35:40 +00:00
Jesse Cai
4cb7dd0fc9 [sparse][quant] Add support for vector alpha in cusparselt mm (#112056)
Summary:

This PR adds in support for passing in a alpha Tensor, which represents
a tensor of alpha values to fuse into the matmul.

```
cusparselt_sparse_mm = alpha A @ B + bias
```

This operation is necessary for quantization, where we would like to
fuse one of the dequant matmuls into the sparse op.

Test Plan:

```
python test/test_sparse_semi_structured -k alpha
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112056
Approved by: https://github.com/cpuhrsch
2023-12-04 16:56:06 +00:00
Antoni Viros
d47f715d29 Expose Flash attn to autograd (#114378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114378
Approved by: https://github.com/drisspg
2023-12-01 23:42:06 +00:00
Jesse Cai
ae593d0393 [sparse][semi-structured][inductor] meta registrations for _cslt_sparse_mm + additional stride checking in test. (#114685)
_cslt_sparse_mm + additional stride checking in test.

Summary:

This PR adds in meta registrations for _cslt_sparse_mm.

Based on the work @drisspg did
in #114370.

Additionally, it updates the tests by checking that the strides of the
spare result and the result returned by sparse+compile are the same, to
avoid errors like those found in

https://github.com/pytorch/pytorch/pull/114477.

Test Plan:
```
python test/test_sparse_semi_structred -k compile_cusparselt
python test/test_sparse_semi_structred -k compile_cutlass
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114685
Approved by: https://github.com/alexsamardzic, https://github.com/drisspg
2023-11-29 00:31:52 +00:00
Jon Chuang
cef79c0df4 [inductor] _sparse_semi_structured_linear fallback - no meta registration; not on testing path (#114477)
Test was wrong in original PR and merged changes were never tested. Further, the sparse op was never actually compiled due to missing `fullgraph=True` and missing meta registration.

When meta is added as per this PR, it gives wrong answers when input needs to be padded and when input needs to be reshaped.

Is this something to do with the generated inductor code for:
```
 constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
...
slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
```
and

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
...
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
```

Failing graphs:
Padded:
```
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  ===== Forward graph 5 =====
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.66 class GraphModule(torch.nn.Module):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[1, 128]"):
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         constant_pad_nd: "f16[32, 128]" = torch.ops.aten.constant_pad_nd.default(primals_3, [0, 0, 0, 31], 0.0)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[32, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(constant_pad_nd, primals_1, primals_2);  constant_pad_nd = primals_1 = primals_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 0, 0, 1);  _sparse_semi_structured_linear = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_2: "f16[1, 128]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[1, 128]" = torch.ops.aten.relu.default(slice_2);  slice_2 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         permute: "f16[128, 1]" = torch.ops.aten.permute.default(primals_3, [1, 0]);  primals_3 = None
[2023-11-23 13:59:51,102] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, le, permute]

```

Reshape:

```
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.69 class GraphModule(torch.nn.Module):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, primals_1: "f16[128, 64]", primals_2: "i16[128, 8]", primals_3: "f16[128]", primals_4: "Sym(s0)", primals_5: "Sym(s1)", primals_6: "f16[s0, s1, 128]"):
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:145, code: x = self.linear(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         mul: "Sym(s0*s1)" = primals_4 * primals_5
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view: "f16[s0*s1, 128]" = torch.ops.aten.view.default(primals_6, [mul, 128]);  primals_6 = mul = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         _sparse_semi_structured_linear: "f16[s0*s1, 128]" = torch.ops.aten._sparse_semi_structured_linear.default(view, primals_1, primals_2, bias = primals_3);  primals_1 = primals_2 = primals_3 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         slice_1: "f16[s0*s1, 128]" = torch.ops.aten.slice.Tensor(_sparse_semi_structured_linear, 1, 0, 9223372036854775807);  _sparse_semi_structured_linear = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         view_1: "f16[s0, s1, 128]" = torch.ops.aten.view.default(slice_1, [primals_4, primals_5, 128]);  slice_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/jonch/Desktop/Programming/mlsys/pytorch/test/test_sparse_semi_structured.py:147, code: return torch.nn.functional.relu(x)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         relu: "f16[s0, s1, 128]" = torch.ops.aten.relu.default(view_1);  view_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(relu)
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         alias_1: "f16[s0, s1, 128]" = torch.ops.aten.alias.default(alias);  alias = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         le: "b8[s0, s1, 128]" = torch.ops.aten.le.Scalar(alias_1, 0);  alias_1 = None
[2023-11-23 14:01:03,463] [0/2] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return [relu, view, le, primals_4, primals_5]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114477
Approved by: https://github.com/jcaip
2023-11-28 19:35:05 +00:00
drisspg
8556a09d44 Require less alignment for attn bias (#114173)
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)

This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.

## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.

Should this be warn_once?

We only call expand, once on the aligned mask.

Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115

@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
2023-11-28 02:40:41 +00:00
PyTorch MergeBot
88a8a0daa4 Revert "Require less alignment for masking (#114173)"
This reverts commit f882c175d8.

Reverted https://github.com/pytorch/pytorch/pull/114173 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing some inductor tests f882c175d8 ([comment](https://github.com/pytorch/pytorch/pull/114173#issuecomment-1823552362))
2023-11-22 21:49:31 +00:00
drisspg
f882c175d8 Require less alignment for masking (#114173)
# Summary
Improved Fix for Attention Mask Alignment Issue (#112577)

This PR addresses Issue #112577 by refining the previously implemented fix, which was found to be incorrect and causes un-needed memory regressions. The update simplifies the approach to handling the alignment of the attention mask for mem eff attention.

## Changes
Alignment Check and Padding: Initially, the alignment of the attention mask is checked. If misalignment is detected, padding is applied, followed by slicing. During this process, a warning is raised to alert users.

Should this be warn_once?

We only call expand, once on the aligned mask.

Reference
https://github.com/facebookresearch/xformers/blob/main/xformers/ops/fmha/cutlass.py#L115

@albanD, @mruberry, @jbschlosser, @walterddr, and @mikaylagawarecki.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114173
Approved by: https://github.com/danthe3rd
2023-11-22 20:02:51 +00:00
Tomasz Bohutyn
84909fef52 Add meta registration for aten.linear_backward (#114359)
Fixes #114358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114359
Approved by: https://github.com/ezyang
2023-11-22 18:24:24 +00:00
Isuru Fernando
4b7f9fa436 Meta register all foreach ops (#112281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112281
Approved by: https://github.com/lezcano
2023-11-21 14:23:09 +00:00
vfdev-5
1f8d00c5a3 [inductor] Added decomposition for upsample_nearest_exact Nd (#113749)
Description:
- Added decomposition for upsample_nearest_exact: 1d, 2d, 3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113749
Approved by: https://github.com/lezcano
2023-11-21 13:03:47 +00:00
lezcano
1d96034816 [BE][easy] Simplify the registration of a few metafunctions (#113635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113635
Approved by: https://github.com/Skylion007
ghstack dependencies: #113634, #113674
2023-11-16 19:09:12 +00:00
lezcano
9b3e694f5d Fix metafunction for many pointwise operations (#113634)
The previous metafunction was completely broken.
It incorrectly used a metafunction that was designed for prims. It also
passed in an incorrect enum class for the type promotion.

Fixes https://github.com/pytorch/pytorch/issues/113119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113634
Approved by: https://github.com/peterbell10
2023-11-16 19:09:12 +00:00
drisspg
c46fc46dba expose mem-eff to autograd (#110495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110495
Approved by: https://github.com/jbschlosser
2023-11-13 17:47:40 +00:00
Edward Z. Yang
f49b8e9313 Register SymInt-aware meta function for mm out, symintify resize (#113202)
Fixes https://github.com/pytorch/pytorch/issues/112489

Fixes https://github.com/pytorch/pytorch/issues/112494

New OpInfo tests for out variants added, since these were not exercised previously.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113202
Approved by: https://github.com/albanD
2023-11-10 14:27:05 +00:00
jiayisun
63d65dd6cd Correct output shape of meta registration for qlinear_pointwise (#112390)
Corrected output shape of meta registration for qlinear_pointwise.
Because the weight of qlinear_pointwise has been transposed during the qLinear weight prepack process, the shape of the weight of qlinear_pointwise is (in_features, out_features).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112390
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2023-11-10 07:50:59 +00:00
eellison
325e0fdfdd Enable masked_scatter_backward for inductor (#109642)
masked_scatter_backward was previously implemented as a
CompositeExplicitAutograd, which involved a decomp that calls
masked_select, and masked_select in general produces data-dependent
shapes that inductor doesn't support. But masked_scatter_backward
reshapes the return value of masked_select such that the end result has
a static shape again.

I have converted masked_scatter_backward into an aten op to avoid this
issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642
Approved by: https://github.com/ezyang
ghstack dependencies: #108170
2023-11-09 01:27:57 +00:00
Aaron Gokaslan
376217cc0b [BE]: Apply FURB145 to make code more readable and idiomatic. (#112990)
Testing out some new rules that are in beta, I think I will apply this one codebase wide once it's out of preview. Replaces the hack of using `[:]` to do copies of list with the proper copy method. More efficient and more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112990
Approved by: https://github.com/ezyang
2023-11-06 13:15:04 +00:00
leslie-fang-intel
a53d29cc18 Enable oneDNN QLinear FP32/BF16 output (#112126)
**Summary**
- PR 2 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QLinear (relu) with BFloat16 or Float32 output.

**TestPlan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qlinear_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112126
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
ghstack dependencies: #112010
2023-11-03 08:20:54 +00:00
leslie-fang-intel
b6fc7af8a0 Enable oneDNN QConv FP32/BF16 output (#112010)
**Summary**

- PR 1 for enabling Int8-Mixed-BF16 PT2E PTQ Quantization with Inductor https://github.com/pytorch/pytorch/issues/111640.
- Enable QConv (relu, add, add_relu) with BFloat16 or Float32 output.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_pt2e
python -u -m pytest test_quantized_op.py -k test_qconv2d_add_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112010
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2023-11-03 08:16:45 +00:00
drisspg
458e7d09fd Add meta func for scaled mm (#112609)
# Summary
Adds a meta implementation for _scaled_mm which is required for dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609
Approved by: https://github.com/eellison, https://github.com/malfet
2023-11-03 03:44:22 +00:00
PyTorch MergeBot
2e29172942 Revert "Add meta func for scaled mm (#112609)"
This reverts commit 75174c3797.

Reverted https://github.com/pytorch/pytorch/pull/112609 on behalf of https://github.com/huydhn due to Sorry for reverting this change, but it is failing ROCm jobs 75174c3797 ([comment](https://github.com/pytorch/pytorch/pull/112609#issuecomment-1791704037))
2023-11-02 23:37:16 +00:00
drisspg
75174c3797 Add meta func for scaled mm (#112609)
# Summary
Adds a meta implementation for _scaled_mm which is required for dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112609
Approved by: https://github.com/eellison, https://github.com/malfet
2023-11-02 18:42:41 +00:00
Peter Bell
04024926f4 Use pytree.tree_map_ everywhere (#112417)
Wherever we discard the output of `tree_map` it's better to call `tree_map_`
which doesn't unflatten the mapped results and so is a lot cheaper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112417
Approved by: https://github.com/lezcano
ghstack dependencies: #112391, #112392, #112393, #112394
2023-10-31 15:57:06 +00:00
lezcano
c8a5bb451e Do not import sympy within torch._prims_common (#112034)
This is the first of a few PRs that avoid importing SymPy at import time.
The pitch here is that we (almost!) do not have SymPy on our API, so
this should be feasible.

This should speed-up torch imports by a good 15% as per
https://dev-discuss.pytorch.org/t/delving-into-what-happens-when-you-import-torch/1589

In this PR we just move a few global imports into local imports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112034
Approved by: https://github.com/ezyang
2023-10-26 12:53:25 +00:00
Jez Ng
ad3572a5dc Unify torch.SymInt and torch.types.SymInt (#110573)
Per @ezyang, this should be fine

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110573
Approved by: https://github.com/ezyang
2023-10-24 16:17:23 +00:00
Yuanjing Shi
920c9adcc6 [MetaTensor] fix inplace copy for meta tensor (#111705)
Fixes #105685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111705
Approved by: https://github.com/ezyang
2023-10-21 06:02:37 +00:00
Jane Xu
93a9b1314b Make step() faster by passing in a tensor vs scalar 1 (#111084)
This is the culminated result of https://github.com/pytorch/pytorch/pull/110954#issuecomment-1758520411.

We are making the code slightly more complicated to gain some perf in minimizing calls to `.copy_()` and `.to()`.

### Code
```
import torch
with torch.cuda.device(0):
    steps = [torch.zeros((), device="cpu", dtype=torch.float32) for i in range(1000)]

    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ]
    ) as p:
        # New code:
        # step_device = steps[0].device
        # one = torch.tensor(1.0, device=step_device) if str(step_device) == "cpu" else 1
        # torch._foreach_add_(steps, one, 1.0)

        # Old code:
        torch._foreach_add_(steps, 1)

    print(p.key_averages().table(sort_by="cpu_time_total"))
```

### Profiles
**with old code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        35.31%      52.089ms        99.99%     147.495ms     147.495ms             1
               aten::add_        25.05%      36.949ms        64.68%      95.406ms      95.406us          1000
                 aten::to         3.97%       5.852ms        39.63%      58.457ms      58.457us          1000
           aten::_to_copy        10.11%      14.917ms        35.66%      52.605ms      52.605us          1000
              aten::copy_        21.65%      31.939ms        21.65%      31.939ms      31.939us          1000
      aten::empty_strided         3.90%       5.749ms         3.90%       5.749ms       5.749us          1000
    cudaDeviceSynchronize         0.01%      18.000us         0.01%      18.000us      18.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 147.513ms
```

**with new code**
```
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
      aten::_foreach_add_        55.06%      49.963ms        99.86%      90.625ms      90.625ms             1
               aten::add_        44.81%      40.662ms        44.81%      40.662ms      40.662us          1000
            aten::detach_         0.01%       8.000us         0.05%      45.000us      45.000us             1
                  detach_         0.04%      37.000us         0.04%      37.000us      37.000us             1
              aten::empty         0.03%      30.000us         0.03%      30.000us      30.000us             1
                 aten::to         0.03%      23.000us         0.03%      23.000us      23.000us             1
    cudaDeviceSynchronize         0.02%      22.000us         0.02%      22.000us      22.000us             1
         aten::lift_fresh         0.01%       6.000us         0.01%       6.000us       6.000us             1
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 90.751ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111084
Approved by: https://github.com/albanD
ghstack dependencies: #111079
2023-10-20 01:34:08 +00:00
Scruel Tao
108378e2af Fix: torch.matrix_exp performance issue (#105225) (#110848)
Fixes #105225

- New implementation for `compute_T18_scale_square` method.
- Always use the highest degree for large batch sizes (size > 1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110848
Approved by: https://github.com/lezcano
2023-10-18 04:43:25 +00:00
Yanbo Liang
29048be41c [Reland] Add int4mm kernel (#111403)
This is a reland for #110914, #111327 and #111390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111403
Approved by: https://github.com/Chillee
2023-10-17 06:33:18 +00:00
PyTorch MergeBot
408e991dfe Revert "Quant: add weight int4pack mm kernel (#110914)"
This reverts commit 9980876cab.

Reverted https://github.com/pytorch/pytorch/pull/110914 on behalf of https://github.com/jeanschmidt due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110914#issuecomment-1765302621))
2023-10-16 21:27:26 +00:00
Brian Hirsh
0d368f586a fix wrong meta for index_select.out (#111364)
fixes https://github.com/pytorch/pytorch/issues/110699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111364
Approved by: https://github.com/ezyang
ghstack dependencies: #111040
2023-10-16 15:18:20 +00:00
Yanbo Liang
9980876cab Quant: add weight int4pack mm kernel (#110914)
Adding the weight int4pack mm CUDA kernel. The kernel comes from the tinnygemm project which developed by Jeff Johnson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110914
Approved by: https://github.com/Chillee
2023-10-13 01:21:18 +00:00
drisspg
e0dbaa04d2 Fix the meta func for mem_eff_backward (#110893)
Fixes #110832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893
Approved by: https://github.com/eellison
2023-10-11 02:58:54 +00:00
Jon Chuang
37afa0c349 fix(inductor): Increase coverage of Inductor ATen lowering (#110473)
Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations.

Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict)

CC: @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473
Approved by: https://github.com/janeyx99
2023-10-04 23:40:46 +00:00
Jon Chuang
3fd938369f add foreach_abs meta registration and inductor decomp (#110468)
Fixes https://github.com/pytorch/pytorch/issues/110458

Somehow it is on allowlist but not on testing path.

CC @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468
Approved by: https://github.com/janeyx99
2023-10-04 06:09:37 +00:00
Mwiza Kunda
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
Mwiza Kunda
83b4aab5bc Allow zero sized tensors to be resized with meta_randperm (#109721)
Failure will be handled by `_maybe_resize_out`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109721
Approved by: https://github.com/ezyang
2023-09-21 18:41:29 +00:00
eellison
d24ba7a634 Add 3d Attn Pattern to match HF Whisper (#109156)
Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr.

Thanks to @drisspg for helping me write the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142
2023-09-20 16:39:31 +00:00
eellison
ad53b53518 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917
2023-09-20 06:38:02 +00:00
PyTorch MergeBot
c2f5d4d8f0 Revert "Generate patterns in fp16 and fp32 (#109142)"
This reverts commit 14994cc978.

Reverted https://github.com/pytorch/pytorch/pull/109142 on behalf of https://github.com/eellison due to MESSAGE ([comment](https://github.com/pytorch/pytorch/pull/109142#issuecomment-1726641232))
2023-09-19 22:52:05 +00:00
eellison
14994cc978 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #108894, #108917
2023-09-19 20:59:42 +00:00
leslie-fang-intel
4a60bd22b2 [Quant][Inductor] Enable quantization dynamic batch size support (#108550)
**Summary**
This Diff enables dynamic batch size support for quantization use case in Inductor. Take the UT in this PR as example, after this PR, the generated code will have assumption of dynamic input batch size.
```
cpp_fused_quantize_per_tensor_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const float* in_ptr0,
                       unsigned char* out_ptr0,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(static_cast<long>(ks1*ks1)); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i2 + (i1*(static_cast<long>(ks1*ks1))) + (3L*i0*(static_cast<long>(ks1*ks1))))];
                    auto tmp1 = static_cast<float>(40.36037717834931);
                    auto tmp2 = decltype(tmp0)(tmp0 * tmp1);
                    auto tmp3 = std::nearbyint(tmp2);
                    auto tmp4 = static_cast<float>(97.0);
                    auto tmp5 = tmp3 + tmp4;
                    auto tmp6 = static_cast<float>(0.0);
                    auto tmp7 = max_propagate_nan(tmp5, tmp6);
                    auto tmp8 = static_cast<float>(255.0);
                    auto tmp9 = min_propagate_nan(tmp7, tmp8);
                    auto tmp10 = static_cast<unsigned char>(tmp9);
                    out_ptr0[static_cast<long>(i1 + (3L*i2) + (3L*i0*(static_cast<long>(ks1*ks1))))] = tmp10;
                }
            }
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       unsigned char* out_ptr1,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(16L))
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={at::vec::Vectorized<float>(0)})
                    float tmp_acc0 = 0;
                    at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L)))); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i1 + (16L*i0) + (16L*i2) + (16L*i0*(static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L))))) + (32L*i0*(at::native::div_floor_integer(ks1, 2L)))));
                        auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                        auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                        auto tmp3 = tmp1 - tmp2;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.010429476387798786));
                        auto tmp5 = tmp3 * tmp4;
                        tmp_acc0_vec = tmp_acc0_vec + tmp5;
                    }
                    tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (16L*i0)));
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            auto tmp0 = out_ptr0[static_cast<long>(i0)];
            auto tmp1 = static_cast<float>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L))));
            auto tmp2 = tmp0 / tmp1;
            auto tmp3 = static_cast<float>(168.09128392896545);
            auto tmp4 = decltype(tmp2)(tmp2 * tmp3);
            auto tmp5 = std::nearbyint(tmp4);
            auto tmp6 = static_cast<float>(0.0);
            auto tmp7 = tmp5 + tmp6;
            auto tmp8 = max_propagate_nan(tmp7, tmp6);
            auto tmp9 = static_cast<float>(255.0);
            auto tmp10 = min_propagate_nan(tmp8, tmp9);
            auto tmp11 = static_cast<unsigned char>(tmp10);
            out_ptr1[static_cast<long>(i0)] = tmp11;
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_2 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       const long ks0)
{
    {
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i0));
            auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
            auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
            auto tmp3 = tmp1 - tmp2;
            auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.0056716203689575195));
            auto tmp5 = tmp3 * tmp4;
            tmp5.store(out_ptr0 + static_cast<long>(i0));
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg8_1, arg9_1, arg10_1 = args
    args.clear()
    s0 = arg8_1
    s2 = arg9_1
    assert_size_stride(arg10_1, (s0, 3, s2, s2), (3*(s2*s2), s2*s2, s2, 1))
    buf0 = empty_strided((s0, 3, s2, s2), (3*(s2*s2), 1, 3*s2, 3), device='cpu', dtype=torch.uint8)
    cpp_fused_quantize_per_tensor_0(c_void_p(arg10_1.data_ptr()), c_void_p(buf0.data_ptr()), c_long(s0), c_long(s2))
    del arg10_1
    buf1 = torch.ops.onednn.qconv2d_pointwise(buf0, 0.024776775389909744, 97, constant5, constant2, constant3, constant0, [1, 1], [1, 1], [1, 1], 1, 95.88209060714476, 0, False, 'relu', [], '')
    assert_size_stride(buf1, (s0, 16, 1 + s2, 1 + s2), (16 + (16*(s2*s2)) + (32*s2), 1, 16 + (16*s2), 16))
    del buf0
    # Source Nodes: [quantize_per_tensor_default_2], Original ATen: [quantized_decomposed.quantize_per_tensor]
    buf2 = torch.ops.quantized.max_pool2d(buf1, [3, 3], [2, 2], [1, 1], [1, 1], False)
    del buf1
    buf3 = buf2
    assert_size_stride(buf3, (s0, 16, 1 + (s2 // 2), 1 + (s2 // 2)), (16 + (16*((s2 // 2)*(s2 // 2))) + (32*(s2 // 2)), 1, 16 + (16*(s2 // 2)), 16))
    del buf2
    buf4 = empty_strided((s0, 16, 1, 1), (16, 1, 16*s0, 16*s0), device='cpu', dtype=torch.float32)
    buf5 = empty_strided((s0, 16), (16, 1), device='cpu', dtype=torch.uint8)
    cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1(c_void_p(buf3.data_ptr()), c_void_p(buf4.data_ptr()), c_void_p(buf5.data_ptr()), c_long(s0), c_long(s2))
    del buf3
    buf6 = torch.ops.onednn.qlinear_pointwise(buf5, 0.005949148442596197, 0, constant6, constant4, constant3, constant1, 176.31645543014483, 100, False, 'none', [], '')
    assert_size_stride(buf6, (s0, 16), (16, 1))
    del buf5
    buf7 = reinterpret_tensor(buf4, (s0, 16), (16, 1)); del buf4  # reuse
    cpp_fused_dequantize_per_tensor_2(c_void_p(buf6.data_ptr()), c_void_p(buf7.data_ptr()), c_long(s0))
    return (buf7, )

```

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_maxpool2d_linear_dynamic
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108550
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-19 08:30:16 +00:00
Jez Ng
7f3885137f Add meta function for _segment_reduce (#109359)
This fixes numerous tests which were xfailing. For instance, the
`_segment_reduce.lengths` OpInfo test, which was previously relying on
the fallback kernel to determine the shape of the meta tensor. The
fallback kernel would fail with

    segment_reduce(): Expected all rows of lengths along axis to sum to data.size(lengths.dim()-1) when !unsafe.

as it was trying to read the values of a meta tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109359
Approved by: https://github.com/ezyang
2023-09-16 13:31:03 +00:00
PyTorch MergeBot
be9f73f031 Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211)"
This reverts commit fe14e43d14.

Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))
2023-09-14 22:29:12 +00:00
Edward Z. Yang
fe14e43d14 Add meta and OpInfo for _embedding_bag_dense_backward (#109211)
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula.  Check comments.

This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-09-14 18:49:32 +00:00
drisspg
ad90ab31f2 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-13 13:59:05 +00:00
Jez Ng
063a62622b Add memory overlap check to meta_copy_ (#108989)
Fixes `test_copy_many_to_one`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989
Approved by: https://github.com/eellison
2023-09-12 23:28:14 +00:00
Peter Bell
464f9c3725 [meta] Add meta implementation for aten.masked_scatter (#108802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108802
Approved by: https://github.com/lezcano
2023-09-12 16:16:05 +00:00
Li-Huai (Allan) Lin
b2cba439b4 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 23:30:40 +00:00