Commit Graph

50 Commits

Author SHA1 Message Date
PyTorch MergeBot
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
Mark Saroufim
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
PyTorch MergeBot
22a249e44e Revert "[Inductor] More robust stride and offset extraction from index expressions (#90184)"
This reverts commit 71f27f7688.

Reverted https://github.com/pytorch/pytorch/pull/90184 on behalf of https://github.com/ngimel due to catastrophically regresses performance
2022-12-08 05:04:15 +00:00
Peter Bell
71f27f7688 [Inductor] More robust stride and offset extraction from index expressions (#90184)
Currently the stride and offset are determined by substituting 1 and 0 for
different indices, which will fail for any expression that doesn't match the
expected stride calculation. Instead, this uses `sympy.match` and returns `None`
for any variables used in non-standard index calculation, e.g. `torch.roll`
which uses `ModularIndexing`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184
Approved by: https://github.com/jansel
2022-12-07 01:43:21 +00:00
XiaobingSuper
2597d5d722 TorchDynamo: always convert flexiblelayout to be FixedLayout when given a stride_order (#89904)
For convolution, we always call **require_stride_order** to convert the input to the target stride order,  if the original input's layout is flexiblelayout, there always have a memory copy because the **is_stride_order_storage_and_layout** only checks the init stride order,  I think for flexiblelayout, means it's layout can be changed, if the user gives a stride order, I think we always need to convert the flexiblelayout to be FixedLayout using given strider order.

Given a CV user case, the max_pooling's output is used by two convolutions, there has two memory copies:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<9; i2+=1)
            {
                {
                    {
                        auto tmp0 = out_ptr0[i1 + (3*i2) + (27*i0)];
                        out_ptr1[i1 + (3*i2) + (27*i0)] = tmp0;
                        out_ptr2[i1 + (3*i2) + (27*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf2 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf4 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf4.data_ptr()))
    del arg4_1
    del buf0
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf2, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    del buf2
    buf5 = torch.ops.mkldnn._convolution_pointwise(buf4, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf5, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf3, buf5, )
```

After this PR, the generated  code will remove the redundant memory copy:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg4_1
    buf2 = torch.ops.mkldnn._convolution_pointwise(buf0, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf2, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf0, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf2, buf3, )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89904
Approved by: https://github.com/jansel
2022-12-06 03:07:53 +00:00
Natalia Gimelshein
1ea20cdb33 workaround for indexing formulas with negative terms (#89933)
Fixes https://github.com/pytorch/torchdynamo/issues/1928
For  `ModularIndexing` we generate indexing code with `//` and `%` operators. When `ModularIndexing` base is negative (that can happen after valid simplifications), `//` in triton produces wrong results https://github.com/openai/triton/issues/619/. For `//` op coming from pytorch, we have codegen workarounds, but I'm reluctant to apply these workarounds to very common indexing computation patterns, both for code readability and perf considerations.
Similarly, we replace `ModularIndexing` with `IndexingDiv` when we can prove that base is small, but those assumptions break when `ModularIndexing` base is negative (`ModularIndexing` is always positive, `IndexingDiv` isn't).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89933
Approved by: https://github.com/jansel
2022-12-05 19:12:29 +00:00
Jean Schmidt
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
PyTorch MergeBot
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
Eli Uriegas
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
Animesh Jain
68805b08d1 [benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780)
Moving to train mode for TIMM models and also raising batch size for accuracy testing.

Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780
Approved by: https://github.com/ngimel
2022-11-30 12:57:35 +00:00
Mark Saroufim
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
Elias Ellison
1a33b7cbfa Make fake tensors preserve dense strides in type conversion (#89803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89803
Approved by: https://github.com/ngimel
2022-11-30 01:28:51 +00:00
XiaobingSuper
0c4f3db7bf TorchDynamo: weight prepack for mkl linear (#89109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:20:19 +00:00
XiaobingSuper
07151a6bd6 TorchDynamo: weight prepack for onednn convolution external call (#88988)
This PR is about enabled weight prepack using the MKLDNN tensor:
1.  enable fake tensor mode for MKLDNN tensor input.
2.  make convolution fusion kernel support MKLDNN tensor input.
3. do the weight prepack at FX fusion step.

For better performance, we always use channels_last for CPU convolution path. because we test that the channels_last path can get a better performance than block input path, and also avoid the activation's layout conversion(plain to block, block to plain), currently, there only need plain to plain format conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88988
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:16:11 +00:00
Elias Ellison
72110d7833 Fix Upsample Decomp Striding For Small Channels (#89528)
Fix for https://github.com/pytorch/torchdynamo/issues/623.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2022-11-23 20:47:39 +00:00
Natalia Gimelshein
a188f05e8c Reland #89031 Added conv constraint that infers layouts (#89530)
Relands #89031
Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
Approved by: https://github.com/Chillee
2022-11-23 20:18:54 +00:00
Brian Hirsh
57353c9608 first draft of input mutation handling for aot autograd (#88817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817
Approved by: https://github.com/ezyang, https://github.com/wconstab
2022-11-23 19:20:11 +00:00
Animesh Jain
120d200620 Revert "Added conv constraint that infers layouts (#89031)" (#89451)
This reverts commit 716f70f19a.

Fixes performance regression and compilation latency increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-22 02:20:50 +00:00
Horace He
716f70f19a Added conv constraint that infers layouts (#89031)
The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.

So, for example, we'll often have something like
```
a: channels-last
b = foo(a)
c = convolution(a)
```

In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.

However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).

This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.

The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-17 01:52:35 +00:00
Elias Ellison
73d71ae3d6 [WIP] Unwrap View in Reinterpret View (#89016)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89016
Approved by: https://github.com/ngimel
2022-11-15 04:40:13 +00:00
XiaobingSuper
7a37bbed15 Take input striding for conv fusion op based on eager output (#88864)
As https://github.com/pytorch/pytorch/pull/88706, we also change the input stride check using eager output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88864
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-15 00:55:07 +00:00
XiaobingSuper
15ef0660c5 Fake Tensor For (ConvFusion) Propagation (#88414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88414
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 12:35:09 +00:00
XiaobingSuper
4ad7b17fab TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 08:42:40 +00:00
Elias Ellison
8ff2e34ca6 Take input striding for conv forward based on eager output (#88706)
From discussion with @Chillee and @ngimel we'll likely need further fixes to ensure that we hit channels last kernels but this is still worth landing in its own right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88706
Approved by: https://github.com/ngimel
2022-11-11 17:29:15 +00:00
Nikolay Korovaiko
c961e45ee5 handle zero dims in reductions (#88280)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88280
Approved by: https://github.com/ngimel
2022-11-11 01:13:57 +00:00
Michael Lazos
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
XiaobingSuper
3e43ff2794 torchdynamo: add convolution add(relu) inplace fusion kernel (#88048)
This PR is about add convolution add(relu) inplace fusion kernel which  works for **other.add_(conv)**.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-10 13:54:37 +00:00
Elias Ellison
2381548071 add stride constraints to fallbacks (#88534)
Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel.

Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534
Approved by: https://github.com/ngimel
2022-11-10 01:13:44 +00:00
Bin Bao
f11f0e4a03 [inductor] Handle nested tuple/list output in fallback kernel (#88495)
Summary: Currently fallback kernel in inductor assumes its output is
either a tensor or a tuple/list of tensors. This PR makes it handle more
generic output data structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495
Approved by: https://github.com/jansel
2022-11-09 15:50:45 +00:00
Bin Bao
955cbe610b [inductor] Handle the case where kwargs contains tensor (#88417)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417
Approved by: https://github.com/ngimel
2022-11-04 20:29:03 +00:00
XiaobingSuper
71f793d312 TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 02:40:29 +00:00
Elias Ellison
7d95b1e344 Run all fallback kernels with FakeTensor (#88248)
This improves the memory compression of resnet18 from .84 -> .94 on inductor no-cudagraphs. It does mean that any extern kernel which incorrectly computes strides will be a hard error at runtime, but that's an issue we are going to have to face with dynamic shapes anyway. CC @ezyang, @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88248
Approved by: https://github.com/ezyang
2022-11-04 02:06:38 +00:00
XiaobingSuper
e4efea4f14 TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:26:08 +00:00
XiaobingSuper
52173188ef TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:10:05 +00:00
Natalia Gimelshein
b4fcfe77b2 reduce the number of autotuning iterations, don't autotune simple til… (#88386)
…ed copies

Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s.

Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386
Approved by: https://github.com/jansel
2022-11-03 15:58:18 +00:00
PyTorch MergeBot
a8561c4571 Revert "[inductor] Handle the case where kwargs contains tensor (#88215)"
This reverts commit 983c0e7f31.

Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py
2022-11-02 23:33:15 +00:00
Bin Bao
983c0e7f31 [inductor] Handle the case where kwargs contains tensor (#88215)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215
Approved by: https://github.com/ngimel
2022-11-02 19:50:16 +00:00
Elias Ellison
e6ea0a4a4b Don't Require contiguous For Extern Kernels (#87650)
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87650
Approved by: https://github.com/desertfire
2022-11-01 20:20:42 +00:00
Elias Ellison
9835413009 Fake Tensor For (Conv) Propagation (#87641)
Resubmitting https://github.com/pytorch/pytorch/pull/87302 so it can be ghstack'd with the pr below.

Incorrect strides in any meta impl would lead to runtime assertion errors for fallback kernels, so start by just enabling it for conv.

Replaces https://github.com/pytorch/pytorch/pull/87588.

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87641
Approved by: https://github.com/jansel
2022-10-29 04:14:01 +00:00
XiaobingSuper
c36db82e12 TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-27 06:55:32 +00:00
Natalia Gimelshein
59aacc40ca Couple fixes for argmax/argmin (#87758)
Removes a wrong assert, makes min number of warps = 2 (1 for some reason generates invalid code, https://github.com/openai/triton/issues/802).
Hopefully fixes https://github.com/pytorch/torchdynamo/issues/1743, cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @mreso

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87758
Approved by: https://github.com/Chillee, https://github.com/soumith
2022-10-26 06:33:43 +00:00
Animesh Jain
ebe5aad466 [inductor] Revert channels-last support (#87588)
We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049

For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump.  Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack.

With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588
Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel
2022-10-25 19:58:25 +00:00
Animesh Jain
c4fecff97d [inductor] Prevent aggressive fusion during inductor lowering (#87447)
Fixes https://github.com/pytorch/torchdynamo/issues/1599

Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.

In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.

Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.

I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.

@ngimel @jansel @Chillee

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
Approved by: https://github.com/jansel
2022-10-24 21:53:17 +00:00
Soumith Chintala
7caeac1718 [inductor] Fix channels_last conv2d propagation when CuDNN is not found (#87266)
Fixes https://github.com/pytorch/torchdynamo/issues/1701

cc @jansel @lezcano @fdrocha @mlazos @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87266
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/voznesenskym
2022-10-21 06:36:16 +00:00
Horace He
68e946b0c3 Fixed tune_layout to not do anything for non-2d convolutions (#87328)
cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87328
Approved by: https://github.com/ngimel
2022-10-20 18:02:51 +00:00
Horace He
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
Zachary DeVito
d36c284d14 [triton] allow cuda properties to be queried from workers (#87101)
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.

Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.

This just moves the needed properties from the main trainer process to the workers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith
2022-10-18 04:48:29 +00:00
Animesh Jain
2b558138cf [inductor] Set correct strides in fallback example run (#87049)
Fixes #ISSUE_NUMBER

Helps in resolving many issues seen in https://github.com/pytorch/torchdynamo/issues/1675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87049
Approved by: https://github.com/jansel
2022-10-17 15:43:53 +00:00
Jason Ansel
30f6f6903c [inductor] Move size asserts to C++, fix bug (#87028)
Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression).

This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension.

This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028
Approved by: https://github.com/anijain2305
2022-10-16 20:17:22 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00