Commit Graph

139 Commits

Author SHA1 Message Date
Bin Bao
1359d16fe8 [CI] Further tighten the checking of two eager runs (#95902)
Summary: To catch nondeterminism in eager if there is any.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902
Approved by: https://github.com/jansel
2023-03-05 14:53:02 +00:00
Edward Z. Yang
c7c4a20321 Update dynamic skips (#95966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-04 23:01:58 +00:00
Jason Ansel
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
Edward Z. Yang
d303665d33 Make int unspecialization actually work (#95621)
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.

The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.

* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.

Fixes https://github.com/pytorch/pytorch/issues/95469

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-03-04 01:22:08 +00:00
Jason Ansel
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
Bin Bao
879f0c3fee [CI] Increate the timeout limit for benchmark test (#95787)
Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up
the limit to reduce flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2023-03-01 19:54:25 +00:00
Bin Bao
e79b2b7792 [CI] Force clear triton cache between running each test (#95729)
Summary: The idea is to see if this reduces some of the flakiness
we have seen on CI. If it does help, then we have a problem in our
caching implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729
Approved by: https://github.com/ngimel
2023-03-01 04:10:03 +00:00
Will Constable
1a72712645 Add dynamo graph break stats to CI (#95635)
Adds columns to csv produced by accuracy job including dynamo graph break stats.

Example output from torchbench CI job:
<img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635
Approved by: https://github.com/ezyang
2023-02-28 16:17:46 +00:00
Edward Z. Yang
3762e801ba Update dynamic skips (#95587)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587
Approved by: https://github.com/voznesenskym
2023-02-28 03:26:55 +00:00
Bin Bao
fa5a4b0dfc [CI] Do not compare two eager run results against fp64 result (#95616)
Summary: When running the benchmark test with --accuracy, two eager runs
should return the same result. If not, we want to detect it early, but
comparing against fp64_output may hide the non-deterministism in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616
Approved by: https://github.com/ZainRizvi
2023-02-27 20:11:21 +00:00
Bin Bao
ab1ab3ab19 [CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478
Approved by: https://github.com/ezyang
2023-02-25 18:54:12 +00:00
Bin Bao
4c8ad93a7c [Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473)
Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473
Approved by: https://github.com/anijain2305
2023-02-24 18:21:36 +00:00
Will Constable
8de4238a31 Add dynamo bench arg --per_process_memory_fraction (#95260)
Simply pipes the arg to the existing torch.cuda API by the same name.

Useful for locally debugging OOMs that happened on a smaller GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260
Approved by: https://github.com/davidberard98
2023-02-22 05:11:18 +00:00
Edward Z. Yang
08370ddad8 Update model skips (#95089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089
Approved by: https://github.com/albanD
2023-02-20 13:24:49 +00:00
Wang, Eikan
954c767bc6 [Inductor] Enable accuracy test for CPPBackend (#94898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-20 05:02:15 +00:00
Edward Z. Yang
a2f44d82f8 Flag guard unbacked SymInt/SymFloat support (#94987)
I believe this fixes the AllenaiLongformerBase problem in periodic.

The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks.

What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-17 00:25:05 +00:00
Edward Z. Yang
7aaebe00ee Fail dynamic_aot_eager AllenaiLongformerBase model (#94986)
```
GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is Eq(i3, -1).  Scroll up to see where each of these data-dependent accesses originally occurred.

While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)})
Original traceback:
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value>
    chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986
Approved by: https://github.com/albanD
2023-02-16 20:02:46 +00:00
Aaron Gokaslan
0444a6c90a [BE] Remove deprecated logging warn method (#94708)
Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708
Approved by: https://github.com/ezyang
2023-02-13 18:24:52 +00:00
Edward Z. Yang
ae7a628b03 Dynamic shapes CI updates (#94690)
Data from https://github.com/pytorch/pytorch/pull/94683

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690
Approved by: https://github.com/cpuhrsch
2023-02-13 18:20:12 +00:00
PyTorch MergeBot
10c430ba0a Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 2a5851735a.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation
2023-02-10 04:40:32 +00:00
Bin Bao
2a5851735a Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-09 23:43:13 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Edward Z. Yang
c028fc4e25 Decouple PT2 dynamic shapes from the functorch setting (#94469)
The functorch setting still exists, but now it is no longer necessary:
we infer use of Python dispatcher by checking if the ambient
FakeTensorMode has a ShapeEnv or not.  The setting still exists,
but it is for controlling direct AOTAutograd use now; for PT2,
it's sufficient to use torch._dynamo.config.dynamic_shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469
Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel
2023-02-09 06:41:41 +00:00
PyTorch MergeBot
ca63040d2b Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 7bfc59993d.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk 7bfc59993d running out of memory.  Mark this as weird because it was green in PR
2023-02-09 01:24:35 +00:00
Bin Bao
7bfc59993d Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-08 23:30:10 +00:00
Jason Ansel
eb1aca162e Re-enable cudagraphs for benchmark scripts (#94192)
Related to https://github.com/pytorch/pytorch/pull/93253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94192
Approved by: https://github.com/albanD, https://github.com/desertfire
2023-02-08 16:38:32 +00:00
chuanqiw
94394e568e change the dynamo benchmark timeout as a parameter (#94284)
Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely.

Fixes #94281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284
Approved by: https://github.com/malfet
2023-02-08 00:45:08 +00:00
Bin Bao
db011e11ea Skip sebotnet33ts_256 on CI (#94067)
Summary: Random failure on CI and it happens more frequently lately.
Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-06 14:58:54 +00:00
Edward Z. Yang
1d53123f44 Report graph breaks separately from graph count (#94143)
graph break != graph count - 1.  Suppose you have a nested
inline function call f1 to f2 to f3.  A graph break in f3
results in six graphs: f1 before, f2 before, f3 before, f3 after,
f2 after, f1 after.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143
Approved by: https://github.com/voznesenskym
2023-02-05 04:03:12 +00:00
Edward Z. Yang
c1da35af5e Update dynamic benchmark skips (#94114)
Data from https://github.com/pytorch/pytorch/pull/94134

Signed-off-by: Edward Z. Yang <ezyangmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114
Approved by: https://github.com/SherlockNoMad
2023-02-04 20:36:51 +00:00
Jason Ansel
e071d72f3c Tag dynamo backends as debug/experimental (#93878)
Hides debug/experimental backends by default.

Before:
```
torch._dynamo.list_backends()
['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm']
```

After:
```
torch._dynamo.list_backends()
['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm']
```

Fixes https://github.com/pytorch/pytorch/issues/93733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878
Approved by: https://github.com/voznesenskym
2023-02-04 00:50:51 +00:00
Jason Ansel
0a93e6db5a Fix/refactor dynamo ipex backend (#93863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863
Approved by: https://github.com/desertfire
2023-02-03 21:42:27 +00:00
Jason Ansel
203b2cad3e Remove fx2trt/torch2trt backends (#93822)
These backends have been broken for some time.  I tried to get them
running again, but as far as I can tell they are not maintained.
Installing torch_tensorrt downgrades PyTorch to 1.12.  If I manually
bypass that downgrade, I get import errors from inside fx2trt.  Fixes that
re-add these are welcome, but it might make sense to move these wrappers
to the torch_tensorrt repo once PyTorch 2.0 support is added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822
Approved by: https://github.com/frank-wei
2023-02-03 21:04:21 +00:00
Jason Ansel
a5ff40032d Fix/refactor dynamo onnxrt backend (#93818)
Fixes https://github.com/pytorch/pytorch/issues/90352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818
Approved by: https://github.com/voznesenskym
2023-02-03 20:48:02 +00:00
Edward Z. Yang
2481fc0df4 Add count to FakeTensorMode.__torch_dispatch__ (#93936)
Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-02-03 14:21:11 +00:00
Fabio Rocha
63115b70f0 Fixed issue with --diff-branch arg in dynamo benchmarks (#93989)
As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio`
and `compression_latency` when you used `--diff-branch`.

This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch.

Also added a couple of more significant figures
to numbers in summary table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989
Approved by: https://github.com/jansel
2023-02-03 08:36:57 +00:00
Jason Ansel
60e8c766b5 Refactor dynamo training backends (#93409)
This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409
Approved by: https://github.com/ezyang
2023-02-03 03:07:15 +00:00
atalman
6e285c479d Remove cuda 11.6 from CI replace with 11.7 (#93406)
Remove cuda 11.6 from CI replace with 11.7
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406
Approved by: https://github.com/malfet, https://github.com/desertfire
2023-02-02 19:16:05 +00:00
Jason Ansel
d7b39b17ab Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279
Approved by: https://github.com/voznesenskym
2023-02-02 02:34:36 +00:00
Edward Z. Yang
03b465a6d0 Add --iterations to benchmark script (#93858)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858
Approved by: https://github.com/williamwen42
2023-02-01 21:56:49 +00:00
Edward Z. Yang
08041c5264 Configurable repro_tolerance for same_two_models (#93398)
Fixes https://github.com/pytorch/pytorch/issues/93293

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398
Approved by: https://github.com/SherlockNoMad
2023-02-01 01:41:48 +00:00
Edward Z. Yang
811e95a15e --dynamic-ci-skips now works for all backends (#93369)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369
Approved by: https://github.com/albanD
2023-01-31 20:07:58 +00:00
Edward Z. Yang
efee879695 Don't suppress warnings in CI. (#93269)
Warnings are an important clue that something bad is going on.
You want to see them in logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269
Approved by: https://github.com/voznesenskym
2023-01-30 19:21:09 +00:00
Edward Z. Yang
9eb402d18e Update dynamic benchmark skips (#93228)
Data from https://github.com/pytorch/pytorch/pull/93223

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228
Approved by: https://github.com/desertfire
2023-01-30 14:22:53 +00:00
XiaobingSuper
9a2becf60a inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247)
For TIMM ghostnet_100, there has such case, concat+inplace_add:

```
import torch
from torch._inductor import config
config.debug = True
torch._dynamo.config.verbose=True

class MockModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y, z):
        out = torch.cat([x, y], dim=1)
        out+=z
        return out

mod = MockModule().eval()
inputs = (
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 128, 16, 16]),
            )
ref = mod(*inputs)

with torch.no_grad():
    opt_model = torch._dynamo.optimize('inductor')(mod)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
print(torch.equal(ref, out))
```

the inductor always get a wrong result, I find that inductor get a wrong code:

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       const float* __restrict__ in_ptr2,
                       const float* __restrict__ in_ptr3,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
            tmp0.store(out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr0[i0];
            out_ptr0[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
            tmp0.store(out_ptr1 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr1[i0];
            out_ptr1[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<2048; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16*i0);
            auto tmp2 = tmp0 + tmp1;
            tmp2.store(out_ptr2 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=32768; i0<32768; i0+=1)
        {
            auto tmp0 = in_ptr2[i0];
            auto tmp1 = in_ptr3[i0];
            auto tmp2 = tmp0 + tmp1;
            out_ptr2[i0] = tmp2;
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1))  # alias
    buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384)  # alias
    buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr()))
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1, arg2_1]))

```
you can see that the add operation always adds a random value, see the ir code:

1. **ir_pre_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```
2. **ir_post_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store
```

From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel.

After this PR, the ir will be like(**ir_pre_fusion.txt**):

```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf2']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf2']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
buf3.mutations = ['buf2']
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247
Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel
2023-01-29 05:35:21 +00:00
Edward Z. Yang
025ef99ddf Get rid of dedicated inductor dynamic_shapes config (#93076)
Instead, use Dynamo dynamic_shapes config

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076
Approved by: https://github.com/voznesenskym
2023-01-27 02:58:16 +00:00
Edward Z. Yang
5e9fa0a8fc Mark crossvit_9_240 as passing dynamic=True (#92981)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981
Approved by: https://github.com/Chillee
2023-01-26 13:05:37 +00:00
Michael Voznesensky
d322f82b05 Add @count util to torch, use it to track benchmark stats (#93013)
<img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013
Approved by: https://github.com/ezyang
2023-01-26 03:09:12 +00:00
Edward Z. Yang
2ee94633a1 Change ciflow/inductor to test inductor inference with dynamic shapes (#92771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771
Approved by: https://github.com/voznesenskym
2023-01-25 02:21:02 +00:00
Edward Z. Yang
f724ecbd52 Add dynamic shapes aot_eager to periodic (#92770)
This means it overlaps with ciflow/inductor, but I'm about
to change that soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770
Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire
2023-01-25 02:21:02 +00:00