Commit Graph

326 Commits

Author SHA1 Message Date
drisspg
e0dbaa04d2 Fix the meta func for mem_eff_backward (#110893)
Fixes #110832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110893
Approved by: https://github.com/eellison
2023-10-11 02:58:54 +00:00
Jon Chuang
37afa0c349 fix(inductor): Increase coverage of Inductor ATen lowering (#110473)
Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations.

Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict)

CC: @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473
Approved by: https://github.com/janeyx99
2023-10-04 23:40:46 +00:00
Jon Chuang
3fd938369f add foreach_abs meta registration and inductor decomp (#110468)
Fixes https://github.com/pytorch/pytorch/issues/110458

Somehow it is on allowlist but not on testing path.

CC @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468
Approved by: https://github.com/janeyx99
2023-10-04 06:09:37 +00:00
Mwiza Kunda
5c4b5baf21 Fix python decomps for OpOverloadPackets and add tests (#107707)
- Extend `test_torch_dispatch_meta_outplace` to test torch ops that do not have an out parameter but have aten op overloads that have out parameters. Additionally, Python decompositions may register `OpOverloadPacket`'s so decompositions need to be tested to ensure all `OpOverloads` still function for the `Meta` key (e.g. if a python decomposition is registered for an aten op `aten.foo` with overloads `[default, out]`, the python function needs to support receiving out arguments)

- Add out parameter wrappers to python decomps for aten ops that have out overloads

CC. @ezyang @albanD @lezcano

Fixes #107713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107707
Approved by: https://github.com/lezcano
2023-09-25 20:53:30 +00:00
Mwiza Kunda
83b4aab5bc Allow zero sized tensors to be resized with meta_randperm (#109721)
Failure will be handled by `_maybe_resize_out`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109721
Approved by: https://github.com/ezyang
2023-09-21 18:41:29 +00:00
eellison
d24ba7a634 Add 3d Attn Pattern to match HF Whisper (#109156)
Adds a 3d pattern that improves perf of HF Whisper from 1.3 -> 4.1. We could be matching more generally on 3d, but i'll leave that for another pr.

Thanks to @drisspg for helping me write the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109156
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917, #109142
2023-09-20 16:39:31 +00:00
eellison
ad53b53518 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #109663, #108894, #108917
2023-09-20 06:38:02 +00:00
PyTorch MergeBot
c2f5d4d8f0 Revert "Generate patterns in fp16 and fp32 (#109142)"
This reverts commit 14994cc978.

Reverted https://github.com/pytorch/pytorch/pull/109142 on behalf of https://github.com/eellison due to MESSAGE ([comment](https://github.com/pytorch/pytorch/pull/109142#issuecomment-1726641232))
2023-09-19 22:52:05 +00:00
eellison
14994cc978 Generate patterns in fp16 and fp32 (#109142)
aten.softmax will generate a different decomposition for fp16/bf16 and fp32 because when invoked in lower precision it will upcast the inputs to fp32 and then downcast after. This has been causing us to miss bf16 patterns. For example, Camembert improves 20% with this PR (as do I'm sure many other models).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109142
Approved by: https://github.com/yanboliang
ghstack dependencies: #108894, #108917
2023-09-19 20:59:42 +00:00
leslie-fang-intel
4a60bd22b2 [Quant][Inductor] Enable quantization dynamic batch size support (#108550)
**Summary**
This Diff enables dynamic batch size support for quantization use case in Inductor. Take the UT in this PR as example, after this PR, the generated code will have assumption of dynamic input batch size.
```
cpp_fused_quantize_per_tensor_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const float* in_ptr0,
                       unsigned char* out_ptr0,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(static_cast<long>(ks1*ks1)); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i2 + (i1*(static_cast<long>(ks1*ks1))) + (3L*i0*(static_cast<long>(ks1*ks1))))];
                    auto tmp1 = static_cast<float>(40.36037717834931);
                    auto tmp2 = decltype(tmp0)(tmp0 * tmp1);
                    auto tmp3 = std::nearbyint(tmp2);
                    auto tmp4 = static_cast<float>(97.0);
                    auto tmp5 = tmp3 + tmp4;
                    auto tmp6 = static_cast<float>(0.0);
                    auto tmp7 = max_propagate_nan(tmp5, tmp6);
                    auto tmp8 = static_cast<float>(255.0);
                    auto tmp9 = min_propagate_nan(tmp7, tmp8);
                    auto tmp10 = static_cast<unsigned char>(tmp9);
                    out_ptr0[static_cast<long>(i1 + (3L*i2) + (3L*i0*(static_cast<long>(ks1*ks1))))] = tmp10;
                }
            }
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       unsigned char* out_ptr1,
                       const long ks0,
                       const long ks1)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
        {
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(16L))
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={at::vec::Vectorized<float>(0)})
                    float tmp_acc0 = 0;
                    at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L)))); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i1 + (16L*i0) + (16L*i2) + (16L*i0*(static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L))))) + (32L*i0*(at::native::div_floor_integer(ks1, 2L)))));
                        auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                        auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                        auto tmp3 = tmp1 - tmp2;
                        auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.010429476387798786));
                        auto tmp5 = tmp3 * tmp4;
                        tmp_acc0_vec = tmp_acc0_vec + tmp5;
                    }
                    tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (16L*i0)));
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            auto tmp0 = out_ptr0[static_cast<long>(i0)];
            auto tmp1 = static_cast<float>(1L + (static_cast<long>((at::native::div_floor_integer(ks1, 2L))*(at::native::div_floor_integer(ks1, 2L)))) + (2L*(at::native::div_floor_integer(ks1, 2L))));
            auto tmp2 = tmp0 / tmp1;
            auto tmp3 = static_cast<float>(168.09128392896545);
            auto tmp4 = decltype(tmp2)(tmp2 * tmp3);
            auto tmp5 = std::nearbyint(tmp4);
            auto tmp6 = static_cast<float>(0.0);
            auto tmp7 = tmp5 + tmp6;
            auto tmp8 = max_propagate_nan(tmp7, tmp6);
            auto tmp9 = static_cast<float>(255.0);
            auto tmp10 = min_propagate_nan(tmp8, tmp9);
            auto tmp11 = static_cast<unsigned char>(tmp10);
            out_ptr1[static_cast<long>(i0)] = tmp11;
        }
    }
}
''')

cpp_fused_dequantize_per_tensor_2 = async_compile.cpp('''
#include "/tmp/torchinductor_root/ib/cibrnuq56cxamjj4krp4zpjvsirbmlolpbnmomodzyd46huzhdw7.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0,
                       const long ks0)
{
    {
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<uint8_t>::loadu_one_fourth(in_ptr0 + static_cast<long>(i0));
            auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
            auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
            auto tmp3 = tmp1 - tmp2;
            auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.0056716203689575195));
            auto tmp5 = tmp3 * tmp4;
            tmp5.store(out_ptr0 + static_cast<long>(i0));
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg8_1, arg9_1, arg10_1 = args
    args.clear()
    s0 = arg8_1
    s2 = arg9_1
    assert_size_stride(arg10_1, (s0, 3, s2, s2), (3*(s2*s2), s2*s2, s2, 1))
    buf0 = empty_strided((s0, 3, s2, s2), (3*(s2*s2), 1, 3*s2, 3), device='cpu', dtype=torch.uint8)
    cpp_fused_quantize_per_tensor_0(c_void_p(arg10_1.data_ptr()), c_void_p(buf0.data_ptr()), c_long(s0), c_long(s2))
    del arg10_1
    buf1 = torch.ops.onednn.qconv2d_pointwise(buf0, 0.024776775389909744, 97, constant5, constant2, constant3, constant0, [1, 1], [1, 1], [1, 1], 1, 95.88209060714476, 0, False, 'relu', [], '')
    assert_size_stride(buf1, (s0, 16, 1 + s2, 1 + s2), (16 + (16*(s2*s2)) + (32*s2), 1, 16 + (16*s2), 16))
    del buf0
    # Source Nodes: [quantize_per_tensor_default_2], Original ATen: [quantized_decomposed.quantize_per_tensor]
    buf2 = torch.ops.quantized.max_pool2d(buf1, [3, 3], [2, 2], [1, 1], [1, 1], False)
    del buf1
    buf3 = buf2
    assert_size_stride(buf3, (s0, 16, 1 + (s2 // 2), 1 + (s2 // 2)), (16 + (16*((s2 // 2)*(s2 // 2))) + (32*(s2 // 2)), 1, 16 + (16*(s2 // 2)), 16))
    del buf2
    buf4 = empty_strided((s0, 16, 1, 1), (16, 1, 16*s0, 16*s0), device='cpu', dtype=torch.float32)
    buf5 = empty_strided((s0, 16), (16, 1), device='cpu', dtype=torch.uint8)
    cpp_fused_dequantize_per_tensor_mean_quantize_per_tensor_1(c_void_p(buf3.data_ptr()), c_void_p(buf4.data_ptr()), c_void_p(buf5.data_ptr()), c_long(s0), c_long(s2))
    del buf3
    buf6 = torch.ops.onednn.qlinear_pointwise(buf5, 0.005949148442596197, 0, constant6, constant4, constant3, constant1, 176.31645543014483, 100, False, 'none', [], '')
    assert_size_stride(buf6, (s0, 16), (16, 1))
    del buf5
    buf7 = reinterpret_tensor(buf4, (s0, 16), (16, 1)); del buf4  # reuse
    cpp_fused_dequantize_per_tensor_2(c_void_p(buf6.data_ptr()), c_void_p(buf7.data_ptr()), c_long(s0))
    return (buf7, )

```

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_maxpool2d_linear_dynamic
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108550
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-19 08:30:16 +00:00
Jez Ng
7f3885137f Add meta function for _segment_reduce (#109359)
This fixes numerous tests which were xfailing. For instance, the
`_segment_reduce.lengths` OpInfo test, which was previously relying on
the fallback kernel to determine the shape of the meta tensor. The
fallback kernel would fail with

    segment_reduce(): Expected all rows of lengths along axis to sum to data.size(lengths.dim()-1) when !unsafe.

as it was trying to read the values of a meta tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109359
Approved by: https://github.com/ezyang
2023-09-16 13:31:03 +00:00
PyTorch MergeBot
be9f73f031 Revert "Add meta and OpInfo for _embedding_bag_dense_backward (#109211)"
This reverts commit fe14e43d14.

Reverted https://github.com/pytorch/pytorch/pull/109211 on behalf of https://github.com/clee2000 due to Sorry I think the test_ops.py::TestCommonCUDA::test_compare_cpu__embedding_bag_dense_backward_cuda_float32 is failing 492a93d185 https://github.com/pytorch/pytorch/actions/runs/6190707847/job/16808644559 not sure why this is run in slow when it looks to be a new test ([comment](https://github.com/pytorch/pytorch/pull/109211#issuecomment-1720235918))
2023-09-14 22:29:12 +00:00
Edward Z. Yang
fe14e43d14 Add meta and OpInfo for _embedding_bag_dense_backward (#109211)
The sample inputs is a bit involved because there are a lot of
shenanigans in the derivative formula.  Check comments.

This is exercised in vdd, internal test `buck2 run '@fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- 'pytorch.benchmark.fb.test_gpu.test_gpu.TestBenchmarkFbGpu.test_train_blue_reels_vdd_v3_inductor_speedup'`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109211
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-09-14 18:49:32 +00:00
drisspg
ad90ab31f2 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-13 13:59:05 +00:00
Jez Ng
063a62622b Add memory overlap check to meta_copy_ (#108989)
Fixes `test_copy_many_to_one`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989
Approved by: https://github.com/eellison
2023-09-12 23:28:14 +00:00
Peter Bell
464f9c3725 [meta] Add meta implementation for aten.masked_scatter (#108802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108802
Approved by: https://github.com/lezcano
2023-09-12 16:16:05 +00:00
Li-Huai (Allan) Lin
b2cba439b4 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 23:30:40 +00:00
PyTorch MergeBot
a7f5abeade Revert "Introduce Tensor overload to linspace and logspace (#104889)"
This reverts commit 57e5239321.

Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))
2023-09-11 17:33:48 +00:00
Li-Huai (Allan) Lin
57e5239321 Introduce Tensor overload to linspace and logspace (#104889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889
Approved by: https://github.com/zou3519
ghstack dependencies: #107958
2023-09-11 15:29:39 +00:00
Huy Do
a9c663c269 Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1c.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 07:43:04 +00:00
PyTorch MergeBot
e45b290127 Revert "Revert "Flash Attention v2 (#105602)" (#108827)"
This reverts commit 24e9bbe22a.

Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))
2023-09-08 03:25:45 +00:00
Huy Do
24e9bbe22a Revert "Flash Attention v2 (#105602)" (#108827)
This reverts commit add45aea1c.

There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually.

The diff has been reverted internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827
Approved by: https://github.com/kit1980
2023-09-08 02:54:20 +00:00
Ken Jin
c458fa0d35 Decompose/add reference for view_as_complex (#108005)
Aten source: d4a99631dd/aten/src/ATen/native/ComplexHelper.h (L78)

Documentation reference:
https://pytorch.org/docs/stable/generated/torch.view_as_complex.html

Note: this adds a new primitive `view_of_dtype`, which is trivially implemented, as its meta function is already implemented elsewhere.

Finally, this is not registered as a decomposition (yet), because TorchInductor does not yet support complex types. It should be added once we do.

Closes https://github.com/pytorch/pytorch/issues/108020 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108005
Approved by: https://github.com/peterbell10, https://github.com/ezyang
2023-09-07 23:49:20 +00:00
Michael Lazos
b193f295b6 Add capturable ASGD impl (#107857)
Add capturable ASGD impl + test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107857
Approved by: https://github.com/janeyx99
2023-09-07 06:30:30 +00:00
drisspg
add45aea1c Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-09-01 22:14:44 +00:00
PyTorch MergeBot
d569e506ab Revert "Flash Attention v2 (#105602)"
This reverts commit 9df3d882c8.

Reverted https://github.com/pytorch/pytorch/pull/105602 on behalf of https://github.com/huydhn due to I think we miss a case here for sm80 build on inductor workflow as it is now OOM on trunk https://github.com/pytorch/pytorch/actions/runs/6042843139 ([comment](https://github.com/pytorch/pytorch/pull/105602#issuecomment-1701974862))
2023-09-01 01:15:01 +00:00
drisspg
9df3d882c8 Flash Attention v2 (#105602)
# Summary
## PR Dependencies
I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier:
- [x] Separate build flags for Flash and MemEff: #107985

### Description
This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao

### Changes Made
The majority of the changes in this pull request involve:

- Copying over the flash_attention sources.
- Updating header files.
- Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd.
- Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates.
- Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80.
- Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes.
- Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources.
- Adding/Updating tests.

### Notes for Reviewers
This is not a fun review, and I apologize in advance.
Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO:
- aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp
- aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github)

There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts.

### Follow up items
- Include the updates from e07aa036db and 9e5e8bc91e | https://github.com/pytorch/pytorch/issues/108108

### Work Items
- [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee
- [x] Let multi_query/attention pass through and test | UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup.
- [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers.
- [x] Update test exercise above codepath
- [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it a4f148b6ab)
- [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b
- [x] Update dispatcher to universally prefer FlashV2
- [x] Update tests to exercise new head_dims
- [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional
- [x] Create template generator script
- [x] Initial cmake support for building kernels/ folder
- [x] Replay CudaGraph changes

### Results
#### Forward only
The TFlops are reported here are on a100 that is underclocked.
![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7)

#### Forward+Backward
Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back.
<img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602
Approved by: https://github.com/huydhn, https://github.com/cpuhrsch
2023-08-31 16:02:20 +00:00
lezcano
239ee76177 Add refs/decomps for dot/vdot (#108194)
Follow-up on https://github.com/pytorch/pytorch/issues/108127#issuecomment-1698142427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108194
Approved by: https://github.com/peterbell10
ghstack dependencies: #108188
2023-08-31 15:30:23 +00:00
rzou
0e4752bafc Allow registering decomps for HigherOrderOp; add decomp for out_dtype (#108080)
We allow registering decomps for HigherOrderOp via the existing decomp
mechanisms:
- I refactored those APIs to accept torch._ops.OperatorBase, which is the base
  class for torch.ops.HigherOrderOperator and torch.ops.OpOverload
- HigherOrderOps must directly call maybe_handle_decomp in their
  ProxyTorchDispatchMode handling in order to resolve decompositions. We
  can change this in the future so that they do not need to do this.

Next, we add an inductor decomp for out_dtype. This decomp shouldn't be
generally available because we want to preserve out_dtype to the backend
for other use cases (i.e. executorch).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108080
Approved by: https://github.com/HDCharles
2023-08-31 03:15:38 +00:00
Xia, Weiwen
15ceafb5c5 [Quant][Inductor] Enable qlinear weight prepack inside inductor constant folding (#106782)
**Summary**
To realize weight prepack for quantized linear, we replace the following pattern
```
int8 activation
      |
dequant_per_tensor
      |
mm/addmm <- t <- dequant_per_channel <- int8_weight
```
with
```
int8 activation
  |
onednn.qlinear_pointwise <- onednn.qlinear_prepack <- int8_weight
```
And we register weight prepack path inside inductor constant folding. Constant folding evaluates the prepack op and replace it with prepacked weight (a constant parameter)

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106782
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
ghstack dependencies: #105818, #106781
2023-08-27 12:53:44 +00:00
leslie-fang-intel
25678e31dc [Quant][Inductor] Enable quantized conv weight prepack inside inductor constant folding (#104581)
**Summary**
Enable quantization conv weight prepack inside inductor constant folding.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104581
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580
2023-08-25 17:37:41 +00:00
Liao, Xuan
a46217d2ef [CPU] Enable fused_attention pattern matcher (#107128)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

Enable the SDPA graph rewriting for Inductor CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107128
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #104583, #104584, #103826, #104693, #104863
2023-08-20 08:53:24 +00:00
Masaki Kozuki
b234b94760 Add in-place _foreach_copy (#107226)
Fixes #107162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107226
Approved by: https://github.com/janeyx99
2023-08-17 00:11:18 +00:00
Tugsbayasgalan Manlaibaatar
20c5add133 [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-15 05:41:43 +00:00
Nikita Karetnikov
e7a3fb13e7 [pt2] add Python metas for special ops (#106683)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106683
Approved by: https://github.com/ezyang
2023-08-13 14:12:21 +00:00
PyTorch MergeBot
354484ea6d Revert "Add _foreach_clamp (#106574)"
This reverts commit 2b560d3c3a.

Reverted https://github.com/pytorch/pytorch/pull/106574 on behalf of https://github.com/kit1980 due to breaking internal windows builds ([comment](https://github.com/pytorch/pytorch/pull/106574#issuecomment-1675400335))
2023-08-11 21:05:04 +00:00
PyTorch MergeBot
745d29b0cc Revert "[export] Refactor constrain_as_value and constrain_as_size (#106591)"
This reverts commit 18989890bf.

Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))
2023-08-11 16:37:47 +00:00
Tugsbayasgalan Manlaibaatar
18989890bf [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-11 05:29:22 +00:00
David Berard
393e9eed90 [inductor] modify index_reduce to pass opinfo tests (#106429)
1. add a python meta registration, to fix an issue with the forward pass. The problem was that previously, the C++ meta registration calls [numel()](7b14a14e27/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L329)) which fails (LMK if it's better to fix the C++ implementation to not do this check)
2. Modify the backward to fix an issue in the backward. The backward is not a custom op - it's a custom manual backward implementation. In particular, there's some situations that don't support double backward; the check for whether double backward is allowed requires a .item() call. To fix the meta/fake tensor case, this PR will avoid setting the double backward error only if `GradMode::is_enabled()` - which shouldn't be turned on in PT2.
3. Update skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106429
Approved by: https://github.com/zou3519
2023-08-10 18:14:00 +00:00
Masaki Kozuki
2b560d3c3a Add _foreach_clamp (#106574)
Rel:
- #106221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106574
Approved by: https://github.com/janeyx99
2023-08-10 05:26:09 +00:00
angelayi
7f9d1cacca [export] Minor fixes to contrain_as_size (#106737)
Fixed some minor issues with constraint APIs while I was helping enable some other model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106737
Approved by: https://github.com/tugsbayasgalan
2023-08-10 00:13:08 +00:00
Nikita Karetnikov
467a2e63f0 [pt2] add Python meta for triangular_solve (#106682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106682
Approved by: https://github.com/ezyang
2023-08-09 18:50:54 +00:00
Nikita Karetnikov
7215007f01 [pt2] add Python meta for polygamma (#106681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106681
Approved by: https://github.com/ezyang
2023-08-07 00:59:14 +00:00
Nikita Karetnikov
f694bcc9a8 [pt2] add meta for _cdist_backward (#106680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106680
Approved by: https://github.com/Skylion007
2023-08-07 00:58:14 +00:00
Nikita Karetnikov
19621a73c0 [pt2] add metas for grid_sampler_3d ops (#106261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106261
Approved by: https://github.com/ezyang
2023-08-05 14:48:11 +00:00
Nikita Karetnikov
bd34f85fe5 [pt2] meta for searchsorted.Scalar, tests, and out support (#106283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106283
Approved by: https://github.com/ezyang
2023-08-05 09:12:29 +00:00
bobby-palmer
3e6da46aff err on dot product for tensors of different sizes (#106572)
Fixes #106448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106572
Approved by: https://github.com/ezyang
2023-08-04 18:34:34 +00:00
Nikita Karetnikov
1f734e03df [pt2] add metas for mode ops (#106273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106273
Approved by: https://github.com/ezyang
ghstack dependencies: #106272
2023-08-03 13:11:10 +00:00
Nikita Karetnikov
70469e6f04 [pt2] add metas for median ops (#106272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106272
Approved by: https://github.com/ezyang
2023-08-03 13:11:10 +00:00
drisspg
f533791cd0 [SDPA] Mirror c++ implementation in FlashAttention meta func (#106477)
# Summary
Test edge case and update meta function to match the c++ implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106477
Approved by: https://github.com/eellison
2023-08-03 00:28:27 +00:00