Commit Graph

113 Commits

Author SHA1 Message Date
chuanqiw
94394e568e change the dynamo benchmark timeout as a parameter (#94284)
Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely.

Fixes #94281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284
Approved by: https://github.com/malfet
2023-02-08 00:45:08 +00:00
Bin Bao
db011e11ea Skip sebotnet33ts_256 on CI (#94067)
Summary: Random failure on CI and it happens more frequently lately.
Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-06 14:58:54 +00:00
Edward Z. Yang
1d53123f44 Report graph breaks separately from graph count (#94143)
graph break != graph count - 1.  Suppose you have a nested
inline function call f1 to f2 to f3.  A graph break in f3
results in six graphs: f1 before, f2 before, f3 before, f3 after,
f2 after, f1 after.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143
Approved by: https://github.com/voznesenskym
2023-02-05 04:03:12 +00:00
Edward Z. Yang
c1da35af5e Update dynamic benchmark skips (#94114)
Data from https://github.com/pytorch/pytorch/pull/94134

Signed-off-by: Edward Z. Yang <ezyangmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114
Approved by: https://github.com/SherlockNoMad
2023-02-04 20:36:51 +00:00
Jason Ansel
e071d72f3c Tag dynamo backends as debug/experimental (#93878)
Hides debug/experimental backends by default.

Before:
```
torch._dynamo.list_backends()
['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm']
```

After:
```
torch._dynamo.list_backends()
['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm']
```

Fixes https://github.com/pytorch/pytorch/issues/93733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878
Approved by: https://github.com/voznesenskym
2023-02-04 00:50:51 +00:00
Jason Ansel
0a93e6db5a Fix/refactor dynamo ipex backend (#93863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863
Approved by: https://github.com/desertfire
2023-02-03 21:42:27 +00:00
Jason Ansel
203b2cad3e Remove fx2trt/torch2trt backends (#93822)
These backends have been broken for some time.  I tried to get them
running again, but as far as I can tell they are not maintained.
Installing torch_tensorrt downgrades PyTorch to 1.12.  If I manually
bypass that downgrade, I get import errors from inside fx2trt.  Fixes that
re-add these are welcome, but it might make sense to move these wrappers
to the torch_tensorrt repo once PyTorch 2.0 support is added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822
Approved by: https://github.com/frank-wei
2023-02-03 21:04:21 +00:00
Jason Ansel
a5ff40032d Fix/refactor dynamo onnxrt backend (#93818)
Fixes https://github.com/pytorch/pytorch/issues/90352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818
Approved by: https://github.com/voznesenskym
2023-02-03 20:48:02 +00:00
Edward Z. Yang
2481fc0df4 Add count to FakeTensorMode.__torch_dispatch__ (#93936)
Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-02-03 14:21:11 +00:00
Fabio Rocha
63115b70f0 Fixed issue with --diff-branch arg in dynamo benchmarks (#93989)
As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio`
and `compression_latency` when you used `--diff-branch`.

This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch.

Also added a couple of more significant figures
to numbers in summary table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989
Approved by: https://github.com/jansel
2023-02-03 08:36:57 +00:00
Jason Ansel
60e8c766b5 Refactor dynamo training backends (#93409)
This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409
Approved by: https://github.com/ezyang
2023-02-03 03:07:15 +00:00
atalman
6e285c479d Remove cuda 11.6 from CI replace with 11.7 (#93406)
Remove cuda 11.6 from CI replace with 11.7
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406
Approved by: https://github.com/malfet, https://github.com/desertfire
2023-02-02 19:16:05 +00:00
Jason Ansel
d7b39b17ab Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279
Approved by: https://github.com/voznesenskym
2023-02-02 02:34:36 +00:00
Edward Z. Yang
03b465a6d0 Add --iterations to benchmark script (#93858)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858
Approved by: https://github.com/williamwen42
2023-02-01 21:56:49 +00:00
Edward Z. Yang
08041c5264 Configurable repro_tolerance for same_two_models (#93398)
Fixes https://github.com/pytorch/pytorch/issues/93293

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398
Approved by: https://github.com/SherlockNoMad
2023-02-01 01:41:48 +00:00
Edward Z. Yang
811e95a15e --dynamic-ci-skips now works for all backends (#93369)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369
Approved by: https://github.com/albanD
2023-01-31 20:07:58 +00:00
Edward Z. Yang
efee879695 Don't suppress warnings in CI. (#93269)
Warnings are an important clue that something bad is going on.
You want to see them in logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269
Approved by: https://github.com/voznesenskym
2023-01-30 19:21:09 +00:00
Edward Z. Yang
9eb402d18e Update dynamic benchmark skips (#93228)
Data from https://github.com/pytorch/pytorch/pull/93223

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228
Approved by: https://github.com/desertfire
2023-01-30 14:22:53 +00:00
XiaobingSuper
9a2becf60a inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247)
For TIMM ghostnet_100, there has such case, concat+inplace_add:

```
import torch
from torch._inductor import config
config.debug = True
torch._dynamo.config.verbose=True

class MockModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y, z):
        out = torch.cat([x, y], dim=1)
        out+=z
        return out

mod = MockModule().eval()
inputs = (
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 128, 16, 16]),
            )
ref = mod(*inputs)

with torch.no_grad():
    opt_model = torch._dynamo.optimize('inductor')(mod)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
print(torch.equal(ref, out))
```

the inductor always get a wrong result, I find that inductor get a wrong code:

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       const float* __restrict__ in_ptr2,
                       const float* __restrict__ in_ptr3,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
            tmp0.store(out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr0[i0];
            out_ptr0[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
            tmp0.store(out_ptr1 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr1[i0];
            out_ptr1[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<2048; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16*i0);
            auto tmp2 = tmp0 + tmp1;
            tmp2.store(out_ptr2 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=32768; i0<32768; i0+=1)
        {
            auto tmp0 = in_ptr2[i0];
            auto tmp1 = in_ptr3[i0];
            auto tmp2 = tmp0 + tmp1;
            out_ptr2[i0] = tmp2;
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1))  # alias
    buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384)  # alias
    buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr()))
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1, arg2_1]))

```
you can see that the add operation always adds a random value, see the ir code:

1. **ir_pre_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```
2. **ir_post_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store
```

From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel.

After this PR, the ir will be like(**ir_pre_fusion.txt**):

```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf2']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf2']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
buf3.mutations = ['buf2']
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247
Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel
2023-01-29 05:35:21 +00:00
Edward Z. Yang
025ef99ddf Get rid of dedicated inductor dynamic_shapes config (#93076)
Instead, use Dynamo dynamic_shapes config

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076
Approved by: https://github.com/voznesenskym
2023-01-27 02:58:16 +00:00
Edward Z. Yang
5e9fa0a8fc Mark crossvit_9_240 as passing dynamic=True (#92981)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981
Approved by: https://github.com/Chillee
2023-01-26 13:05:37 +00:00
Michael Voznesensky
d322f82b05 Add @count util to torch, use it to track benchmark stats (#93013)
<img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013
Approved by: https://github.com/ezyang
2023-01-26 03:09:12 +00:00
Edward Z. Yang
2ee94633a1 Change ciflow/inductor to test inductor inference with dynamic shapes (#92771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771
Approved by: https://github.com/voznesenskym
2023-01-25 02:21:02 +00:00
Edward Z. Yang
f724ecbd52 Add dynamic shapes aot_eager to periodic (#92770)
This means it overlaps with ciflow/inductor, but I'm about
to change that soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770
Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire
2023-01-25 02:21:02 +00:00
Edward Z. Yang
fb46d3e138 Run all of the timm models shards in the periodic (#92900)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92900
Approved by: https://github.com/bdhirsh, https://github.com/atalman
2023-01-24 17:56:20 +00:00
Horace He
c0327eb463 Some more inductor fixes for symbolic shapes (#92867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92867
Approved by: https://github.com/ezyang
2023-01-24 15:05:46 +00:00
PyTorch MergeBot
2cf03bbbab Revert "Run all of the timm models shards in the periodic (#92743)"
This reverts commit de69cedf98.

Reverted https://github.com/pytorch/pytorch/pull/92743 on behalf of https://github.com/atalman due to This needs to be landed after https://github.com/pytorch/pytorch/pull/92845 and https://github.com/pytorch/pytorch/pull/92846 are landed
2023-01-23 23:44:09 +00:00
Fabio Rocha
a43b55e135 A few usability improvements for the dynamo benchmarks. (#92713)
--diff_main renamed to --diff-branch BRANCH and now works again
Summary table splits results per branch.
csv output now has column with branch name when run in this mode

Added --progress flag so you can track how many models are going to be
run.

Example output:
```
$ python benchmarks/dynamo/torchbench.py  --quiet --performance --backend inductor --float16 --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)   --filter 'alexnet|vgg16' --progress  --diff viable/strict
Running model 1/2
batch size: 1024
cuda eval  alexnet                             dynamo_bench_diff_branch   1.251x p=0.00
cuda eval  alexnet                             viable/strict              1.251x p=0.00
Running model 2/2
batch size: 128
cuda eval  vgg16                               dynamo_bench_diff_branch   1.344x p=0.00
cuda eval  vgg16                               viable/strict              1.342x p=0.00

Summary for tag=dynamo_bench_diff_branch:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.09x mean=25.26x
compilation_latency mean=2.0 seconds
compression_ratio   mean=0.9x

Summary for tag=viable/strict:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.11x mean=25.29x
compilation_latency mean=0.5 seconds
compression_ratio   mean=1.0x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92713
Approved by: https://github.com/jansel
2023-01-23 18:23:35 +00:00
Edward Z. Yang
4a3fb7bcbc Make CI_SKIPS into a consolidated dict (#92769)
This makes it easier to add more configurations without causing a
thicket of if statements selecting the correct variable.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92769
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
Edward Z. Yang
3cfd2fa1c7 Make --inductor imply --backend inductor (#92764)
This is to make some downstream code more uniform (can always ask args.backend for backend)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92764
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
Edward Z. Yang
c52567ec18 Switch CI exclusions to use exact match. (#92761)
Since the CI exclusions are hard-coded in our script, we might as well require them to match exactly. This solved some head scratching where I was like, "this model is not obviously excluded, why is it not showing up in CI."

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92761
Approved by: https://github.com/jansel
2023-01-22 17:10:20 +00:00
Edward Z. Yang
de69cedf98 Run all of the timm models shards in the periodic (#92743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92743
Approved by: https://github.com/kit1980
2023-01-21 18:39:17 +00:00
Michael Voznesensky
5778c04a15 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-21 10:52:13 +00:00
Edward Z. Yang
27bf879b8c Forward fix: restore sebotnet33ts_256 aot_eager skip (#92741)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92741
Approved by: https://github.com/kit1980
2023-01-21 08:10:23 +00:00
Edward Z. Yang
9ad0aca6e5 Update aot_eager CI failures (#92696)
Based on https://hud.pytorch.org/pr/92689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92696
Approved by: https://github.com/desertfire
2023-01-21 02:29:22 +00:00
PyTorch MergeBot
44132cc4b0 Revert "Add --timing flag, phase timing to @dynamo_timed (#92637)"
This reverts commit 773b513435.

Reverted https://github.com/pytorch/pytorch/pull/92637 on behalf of https://github.com/malfet due to Broke lint
2023-01-20 16:23:20 +00:00
Michael Voznesensky
773b513435 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-20 05:01:21 +00:00
Edward Z. Yang
44e52ea514 Reenable mobilevit_s in CI, seems to pass (#92585)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92585
Approved by: https://github.com/Chillee
2023-01-19 15:24:45 +00:00
Edward Z. Yang
b92a7afed9 Reclassify some dynamic aot_eager failures as static failures (#92376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92376
Approved by: https://github.com/Chillee
2023-01-18 19:27:11 +00:00
Wu, Chunyuan
3aa6cec18c [dynamo] exclude reset_rng_state when measure timing (#92237)
Fixes inductor performance regression on CPU: https://github.com/pytorch/torchdynamo/issues/2027, https://github.com/pytorch/torchdynamo/issues/2028 and https://github.com/pytorch/torchdynamo/issues/2029.
The details are explained here: https://github.com/pytorch/torchdynamo/issues/2028#issuecomment-1381496678.

### Performance

- Model: lennard_jones
- Machine: IceLake (32 cores per socket)
- Configuration: single instance, 32 cores per instance
- jemalloc and iomp enabled

```bash
python benchmarks/dynamo/torchbench.py  --inductor-settings --inductor --performance --float32 -dcpu -n5000  --no-skip --dashboard --only=lennard_jones --quiet
```

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

Time before   regression | Time after regression | Time with this PR
-- | -- | --
0.00020483799744397402 | 0.0002818034990923479 | 0.00020241099991835654

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92237
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-18 13:17:28 +00:00
Edward Z. Yang
fbbb19599a Update dynamic skips after #92076 (#92103)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92103
Approved by: https://github.com/voznesenskym, https://github.com/Chillee
2023-01-13 04:05:10 +00:00
Edward Z. Yang
74cbf058a5 Support --dynamic-ci-skips (#91893)
This makes it easier for us to run only the skipped benchmarks and
see if that actually started passing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91893
Approved by: https://github.com/albanD
2023-01-11 20:02:58 +00:00
Edward Z. Yang
d24324bf1d s/INDCUTOR/INDUCTOR/ (#91885)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91885
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
2023-01-11 12:28:19 +00:00
Edward Z. Yang
56ed976edf hrnet_w18, tts_angular works with dynamic shapes (#91891)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91891
Approved by: https://github.com/voznesenskym
2023-01-11 11:40:16 +00:00
blzheng
0c1777acec Dynamo benchmark: add CPU specific changes (#88477)
This pr adds some CPU specific changes:

- Add support for IPEX backend
- https://github.com/pytorch/torchdynamo/issues/1618
- https://github.com/pytorch/torchdynamo/issues/1534
- Enable CPU launcher in runner.py.
- Fix the issue that some environment variables are not support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-07 09:26:06 +00:00
Shunting Zhang
a5f32f8978 training support for dynamo+torchxla integration (#88449)
We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla.

Training is trickier than inference and we may not expect much perf gains since
1. in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in `torchxla_trace_once` bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations.
2. in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency.
3. in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces.

But we still want to add training support to dynamo/torchxla to make the work complete.

We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above.

Results:

With '--iterations-per-run' equals to 1, here are the perf numbers:
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |             0.91   |                0.959    |
+-------------------------+--------------------+-------------------------+
| resnet50                |             0.917  |                0.932    |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |             0.912  |                0.905    |
+-------------------------+--------------------+-------------------------+
| alexnet                 |             1.038  |                0.974    |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |             0.881  |                0.835    |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |             0.903  |                0.931    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |             0.914  |                0.967    |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |             1.359  |                0.84     |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |             1.288  |                0.893    |
+-------------------------+--------------------+-------------------------+
| geomean                 |             1.0006 |                0.913794 |
+-------------------------+--------------------+-------------------------+
```

Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run'

NOTE:
In torchbench.py I added the following code to do a few workaround:
```
from myscripts import workaround # TODO will remove this line before landing
```

Here are the content of workaround.py:
```
import torch
from torch import nn
import os

# override max_pool2d with avg_pool2d
if os.environ.get("REPLACE_MAXPOOL", "0") == "1":
    torch.nn.MaxPool2d = torch.nn.AvgPool2d

```

It work around a few issues we found
1. MaxPool2d does not work for training in dynamo/torchxla: https://github.com/pytorch/torchdynamo/issues/1837 . WIP fix from Brian in https://github.com/pytorch/pytorch/pull/90226 , https://github.com/pytorch/xla/pull/4276/files (WIP)
2. recent change ( this PR https://github.com/pytorch/pytorch/pull/88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in https://github.com/pytorch/xla/pull/4282#event-7969608134 . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd)
3. we have issue to handle dropout because of random seed out of sync issue. Here is the fix: https://github.com/pytorch/xla/pull/4293 (confirmed the fix)

Example command:
```
REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88449
Approved by: https://github.com/wconstab, https://github.com/qihqi, https://github.com/malfet
2023-01-05 19:59:34 +00:00
Bin Bao
6bf0e3b697 [inductor] Check for BackendCompilerFailed on CI (#91634)
Summary: https://github.com/pytorch/pytorch/pull/91283/ skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91634
Approved by: https://github.com/ngimel
2023-01-03 22:38:29 +00:00
Animesh Jain
a32916190d buck-related minifier work (#91215)
Summary: Extending the minifier to generate buck target

Test Plan: N/A

Differential Revision: D42173893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91215
Approved by: https://github.com/bertmaher, https://github.com/ngimel
2022-12-22 19:33:50 +00:00
Bin Bao
07c61685c8 [inductor] CI improvments (#91283)
Summary:
1) Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
2) Skip Triton failure instead of retry
3) Some minor script cleanup is also included in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91283
Approved by: https://github.com/anijain2305
2022-12-22 15:37:43 +00:00
Michael Lazos
2f5759eaba Disable non-deterministic models for optimizers (#91149)
These two models are non-deterministic even with constant inputs + weights and sometimes fail with variations between the fp64 and fp32 models in CI very rarely as a result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91149
Approved by: https://github.com/desertfire
2022-12-20 20:19:54 +00:00