Commit Graph

49 Commits

Author SHA1 Message Date
Bin Bao
282dfe8ba4 [inductor][Reland] Use decomposition for _to_copy (#90494)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90494
Approved by: https://github.com/ngimel
2022-12-09 16:51:50 +00:00
PyTorch MergeBot
e89685b0b5 Revert "[inductor] Use decomposition for _to_copy (#90314)"
This reverts commit 3fdb5f2dda.

Reverted https://github.com/pytorch/pytorch/pull/90314 on behalf of https://github.com/desertfire due to regresses performance on hf_Bert
2022-12-08 18:29:06 +00:00
Bin Bao
d2ee94231e [inductor] Fallback for index with None in the middle of indices (#90022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90022
Approved by: https://github.com/ngimel
2022-12-08 16:18:57 +00:00
Bin Bao
3fdb5f2dda [inductor] Use decomposition for _to_copy (#90314)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90314
Approved by: https://github.com/ngimel
2022-12-08 15:25:44 +00:00
Bin Bao
d7c30e11c6 [inductor] Remove .to from lowering (#90280)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90280
Approved by: https://github.com/ngimel
2022-12-08 00:40:41 +00:00
Peter Bell
e6a7278753 Give std/var correction overloads proper defaults (#56398)
The correction overloads defaults were left off for forward
compatibility reasons, but this FC window expired well over a year ago
at this point.

Differential Revision: [D29625593](https://our.internmc.facebook.com/intern/diff/D29625593)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56398
Approved by: https://github.com/mruberry
2022-12-07 15:15:00 +00:00
Peter Bell
4f44877983 [Inductor] Add test for Scheduler fusions (#90014)
Currently there is `test_vertical_fusion1` which fuses entirely during
the lowering stage and no buffers are realized. This adds
`test_scheduler_vertical_fusion1` which is the same test but with
several intermediate calculations realized so the scheduler is left
to do the fusion.

To support the test, this PR also adds:
- `metrics.ir_nodes_pre_fusion` which when compared with
`generated_kernel_count` tells us how many nodes were fused.
- `torch._test_inductor_realize` which is an identity operator in
eager, but under inductor also forces the input to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90014
Approved by: https://github.com/jansel
2022-12-07 01:33:25 +00:00
Elias Ellison
6addc8d923 [Inductor] add expm1 lowering (#89961)
Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961
Approved by: https://github.com/ngimel
2022-12-02 04:29:54 +00:00
XiaobingSuper
42f27c322b TorchDynamo: don't compute index for max_pooling when return_index is false (#89838)
For max_pooling, if return_index  is **False**, we don't need compute the index.

Before:

```
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp12 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp17 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp22 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp27 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp32 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp37 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = static_cast<long>((2*i2) + (14*i1));
                            auto tmp3 = static_cast<long>(1 + (2*i2) + (14*i1));
                            auto tmp4 = tmp2 > tmp0;
                            auto tmp5 = tmp4 ? tmp3 : tmp1;
                            auto tmp6 = (tmp0 != tmp0) ? tmp0 : std::max(tmp2, tmp0);
                            auto tmp8 = static_cast<long>(2 + (2*i2) + (14*i1));
                            auto tmp9 = tmp7 > tmp6;
                            auto tmp10 = tmp9 ? tmp8 : tmp5;
                            auto tmp11 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp13 = static_cast<long>(7 + (2*i2) + (14*i1));
                            auto tmp14 = tmp12 > tmp11;
                            auto tmp15 = tmp14 ? tmp13 : tmp10;
                            auto tmp16 = (tmp11 != tmp11) ? tmp11 : std::max(tmp12, tmp11);
                            auto tmp18 = static_cast<long>(8 + (2*i2) + (14*i1));
                            auto tmp19 = tmp17 > tmp16;
                            auto tmp20 = tmp19 ? tmp18 : tmp15;
                            auto tmp21 = (tmp16 != tmp16) ? tmp16 : std::max(tmp17, tmp16);
                            auto tmp23 = static_cast<long>(9 + (2*i2) + (14*i1));
                            auto tmp24 = tmp22 > tmp21;
                            auto tmp25 = tmp24 ? tmp23 : tmp20;
                            auto tmp26 = (tmp21 != tmp21) ? tmp21 : std::max(tmp22, tmp21);
                            auto tmp28 = static_cast<long>(14 + (2*i2) + (14*i1));
                            auto tmp29 = tmp27 > tmp26;
                            auto tmp30 = tmp29 ? tmp28 : tmp25;
                            auto tmp31 = (tmp26 != tmp26) ? tmp26 : std::max(tmp27, tmp26);
                            auto tmp33 = static_cast<long>(15 + (2*i2) + (14*i1));
                            auto tmp34 = tmp32 > tmp31;
                            auto tmp35 = tmp34 ? tmp33 : tmp30;
                            auto tmp36 = (tmp31 != tmp31) ? tmp31 : std::max(tmp32, tmp31);
                            auto tmp38 = static_cast<long>(16 + (2*i2) + (14*i1));
                            auto tmp39 = tmp37 > tmp36;
                            auto tmp40 = tmp39 ? tmp38 : tmp35;
                            auto tmp41 = (tmp36 != tmp36) ? tmp36 : std::max(tmp37, tmp36);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp41;
                        }
                    }
                }
            }
        }
    }
}
''')
```
After:

```
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89838
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 04:15:45 +00:00
Elias Ellison
275ade6371 Enable rsqrt (#89771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89771
Approved by: https://github.com/anijain2305
2022-11-30 02:08:13 +00:00
XiaobingSuper
0c4f3db7bf TorchDynamo: weight prepack for mkl linear (#89109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:20:19 +00:00
Natalia Gimelshein
a188f05e8c Reland #89031 Added conv constraint that infers layouts (#89530)
Relands #89031
Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
Approved by: https://github.com/Chillee
2022-11-23 20:18:54 +00:00
Animesh Jain
82713a1cc4 [inductor][compilation time] Fallback when kernel size for avg/max pool is large (#89448)
This fixes compilation time for yolov3 from 400 seconds to 48 seconds. yolov3 has a 13x13 max_pool2d kernel, which was creating really large Triton code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89448
Approved by: https://github.com/ngimel
2022-11-22 02:23:24 +00:00
Animesh Jain
120d200620 Revert "Added conv constraint that infers layouts (#89031)" (#89451)
This reverts commit 716f70f19a.

Fixes performance regression and compilation latency increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-22 02:20:50 +00:00
Peter Bell
c068fa900f [inductor] Misc division lowering fixes (#88603)
1. `aten.div.Tensor_mode` should allow broadcasting
2. `div` can use `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT`
3. `prims.div` on integers should be truncating division
4. Add lowering for `true_divide` which is aliased to `div`
5. register lowering for inplace version of `div_mode`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88603
Approved by: https://github.com/ngimel
2022-11-21 20:56:41 +00:00
Natalia Gimelshein
51e961dd7b use std/libdevice erf in inductor (#89388)
By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
Approved by: https://github.com/jansel
2022-11-21 00:58:03 +00:00
Horace He
716f70f19a Added conv constraint that infers layouts (#89031)
The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.

So, for example, we'll often have something like
```
a: channels-last
b = foo(a)
c = convolution(a)
```

In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.

However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).

This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.

The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-17 01:52:35 +00:00
Nikolay Korovaiko
8506b305df handle scatter(Scalar) overload in inductor (#88894)
Relanding https://github.com/pytorch/pytorch/pull/88210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88894
Approved by: https://github.com/desertfire
2022-11-17 00:38:47 +00:00
XiaobingSuper
4ad7b17fab TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 08:42:40 +00:00
XiaobingSuper
3e43ff2794 torchdynamo: add convolution add(relu) inplace fusion kernel (#88048)
This PR is about add convolution add(relu) inplace fusion kernel which  works for **other.add_(conv)**.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-10 13:54:37 +00:00
Elias Ellison
2381548071 add stride constraints to fallbacks (#88534)
Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel.

Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534
Approved by: https://github.com/ngimel
2022-11-10 01:13:44 +00:00
Bin Bao
f11f0e4a03 [inductor] Handle nested tuple/list output in fallback kernel (#88495)
Summary: Currently fallback kernel in inductor assumes its output is
either a tensor or a tuple/list of tensors. This PR makes it handle more
generic output data structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495
Approved by: https://github.com/jansel
2022-11-09 15:50:45 +00:00
Fabio Rocha
652af5ec15 upsample_*.vec ops are now CompositeImplicit (#85638)
It was previously CompositeExplicit but it was not really necessary.
See discussion in https://github.com/pytorch/pytorch/issues/85405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85638
Approved by: https://github.com/ezyang, https://github.com/lezcano, https://github.com/malfet, https://github.com/jansel
2022-11-09 09:58:04 +00:00
Peter Bell
8e2627d42f [inductor] Fix aten.fmod lowering (#88602)
Currently the lowering for aten.fmod promotes integral types to float and calls
`tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602
Approved by: https://github.com/jansel
2022-11-08 20:27:36 +00:00
Natalia Gimelshein
53ca5ad347 enable scalar reduction with dim=-1 (#88628)
Tested with all samples for `sum`, but also fixes all samples errors on other reductions (amin, amax, any, all etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88628
Approved by: https://github.com/desertfire
2022-11-08 17:06:28 +00:00
PyTorch MergeBot
b00c43b310 Revert "fallback for scatter_(scalar) (#88210)"
This reverts commit 896fa8c5c9.

Reverted https://github.com/pytorch/pytorch/pull/88210 on behalf of https://github.com/suo due to this broke inductor tests, see: 896fa8c5c9
2022-11-07 22:29:56 +00:00
Nikolay Korovaiko
896fa8c5c9 fallback for scatter_(scalar) (#88210)
`scatter_reduce_` overloads can only accept `Tensor src`.
`scatter_`, on the other hand, can accept `Number src`. Switching a fallback from `scatter_reduce_` to `scatter_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88210
Approved by: https://github.com/desertfire
2022-11-07 21:25:55 +00:00
Peter Bell
791d9ee253 [inductor] Add lowering for as_strided_scatter (#88379)
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379
Approved by: https://github.com/jansel
2022-11-07 00:59:29 +00:00
Bin Bao
955cbe610b [inductor] Handle the case where kwargs contains tensor (#88417)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417
Approved by: https://github.com/ngimel
2022-11-04 20:29:03 +00:00
XiaobingSuper
71f793d312 TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 02:40:29 +00:00
XiaobingSuper
e4efea4f14 TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:26:08 +00:00
XiaobingSuper
52173188ef TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:10:05 +00:00
Nikolay Korovaiko
002dad35f4 better error message for out= ops (#88367)
In cases where a tensor kwarg is actually "out=", the following error message would look nicer than this :
```
Traceback (most recent call last):
  File "/fsx/users/binbao/pytorch/torch/_inductor/graph.py", line 241, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/fsx/users/binbao/pytorch/torch/_inductor/lowering.py", line 168, in wrapped
    assert not any(isinstance(x, TensorBox) for x in kwargs.values())
AssertionError

```

https://github.com/pytorch/torchdynamo/issues/1798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88367
Approved by: https://github.com/desertfire
2022-11-03 16:20:14 +00:00
PyTorch MergeBot
a8561c4571 Revert "[inductor] Handle the case where kwargs contains tensor (#88215)"
This reverts commit 983c0e7f31.

Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py
2022-11-02 23:33:15 +00:00
Bin Bao
983c0e7f31 [inductor] Handle the case where kwargs contains tensor (#88215)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215
Approved by: https://github.com/ngimel
2022-11-02 19:50:16 +00:00
Natalia Gimelshein
1bc0e923bb add special case for power of 0.5 (#87912)
Workaround for https://github.com/pytorch/torchdynamo/issues/1775, and calling sqrt is better in any case, but `libdevice.pow` still for some reason doesn't work if both arguments are scalars

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @mreso, can you please check if that takes you further with diffusers

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87912
Approved by: https://github.com/desertfire
2022-10-28 16:09:25 +00:00
XiaobingSuper
c36db82e12 TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-27 06:55:32 +00:00
Jason Ansel
707218f125 Reland #87025 and fix periodic tests (#87084)
- Relands #87025
- disables failing tests related to https://github.com/pytorch/torchdynamo/issues/1697
- Reverts d01eea6027

cc @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87084
Approved by: https://github.com/malfet, https://github.com/voznesenskym
2022-10-22 03:18:17 +00:00
Natalia Gimelshein
6775c3e19d fix 0d cpu tensor handling when it's the first arg (#87273)
Fixes https://github.com/pytorch/torchdynamo/issues/1681
When at least one of the pw args is on cuda, set device to cuda. We assume that cases of true device mismatch have been already weeded out during tracing, and what we have is 0d cpu tensor + cuda tensor interop.
Also fix 0d tensor test that previously wasn't compiling with dynamo.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87273
Approved by: https://github.com/soumith, https://github.com/voznesenskym
2022-10-19 16:55:27 +00:00
XiaobingSuper
232fbd90ff [TorchDynamo]: fused bias for cpu convolution path (#87050)
For aten.convolution CPU path, the bias always can be fused, so this PR adds a device check: if inputs' device is CPU, we will fuse it for a good performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87050
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-19 07:13:38 +00:00
Horace He
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
Fabio Rocha
e4285f09b9 [inductor] new way to compile f64 libdevice calls (#87189)
Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633)

`torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants
of some functions. You can request dispatch to those for
float64 dtypes when using `register_pointwise` by setting
`use_libdevice_for_f64=True`.

Other minor changes:
    - In triton, sigmoid now codegens tl.sigmoid
    - silu now comes from decomp, not lowering
    - Some test skips no longer necessary, removed or made xfails

Switching to `tl.sigmoid` has exactly same performance.
Moving `silu` to decomp does not change anything, same triton code is generated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189
Approved by: https://github.com/ngimel
2022-10-18 19:13:11 +00:00
Horace He
adc7ee09dc Added upsample_nearest3d/1d lowering to inductor (#87158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87158
Approved by: https://github.com/ngimel
2022-10-18 18:27:56 +00:00
Zachary DeVito
d36c284d14 [triton] allow cuda properties to be queried from workers (#87101)
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.

Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.

This just moves the needed properties from the main trainer process to the workers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith
2022-10-18 04:48:29 +00:00
PyTorch MergeBot
2c6167c4bb Revert "[inductor] Use decomps for unfold (#87025)"
This reverts commit 5099883f05.

Reverted https://github.com/pytorch/pytorch/pull/87025 on behalf of https://github.com/ZainRizvi due to Breaks periodic tests
2022-10-17 15:44:15 +00:00
Jason Ansel
5099883f05 [inductor] Use decomps for unfold (#87025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87025
Approved by: https://github.com/soumith
2022-10-16 17:10:33 +00:00
Jason Ansel
054a2fd6c2 Sync changes from pytorch/torchdynamo (#87013)
This updates to:
6380959be2

Generated with:
https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
Approved by: https://github.com/voznesenskym
2022-10-15 21:00:57 +00:00
Jason Ansel
8f71e8de7e Sync changes from pytorch/torchdynamo, enable tests (#86950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86950
Approved by: https://github.com/Chillee
2022-10-14 23:08:58 +00:00
Jason Ansel
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00