Commit Graph

650 Commits

Author SHA1 Message Date
Animesh Jain
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
Jason Ansel
9aa3fedb75 Slightly faster FX graph iterator (#121611)
Before:
```
iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s)
```

After:
```
iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611
Approved by: https://github.com/oulgen
2024-03-11 20:00:19 +00:00
James Wu
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00
Yifu Wang
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
angelayi
58ac4a2007 Remove llava from ci_expected_accuracy as it's flaky (#121322)
https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322
Approved by: https://github.com/desertfire
2024-03-06 20:47:01 +00:00
angelayi
ae4c85960f Add Deberta pass (#121206)
Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206
Approved by: https://github.com/desertfire
2024-03-05 17:56:25 +00:00
Sun, Jiayi
ee557d8f61 skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697)
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:

```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124.,  82.,  ...,   3.,   4.,   5.],
         [125., 104.,  65.,  ...,   3.,   3.,   4.],
         [ 87.,  68.,  34.,  ...,   2.,   2.,   2.],
         ...,
         [ 47.,  45.,  41.,  ...,  45.,  45.,  45.],
         [ 46.,  44.,  40.,  ...,  44.,  45.,  46.],
         [ 46.,  44.,  40.,  ...,  43.,  45.,  46.]],

        [[154., 129.,  84.,  ...,   3.,   4.,   5.],
         [133., 110.,  69.,  ...,   3.,   3.,   4.],
         [ 95.,  76.,  43.,  ...,   2.,   2.,   2.],
         ...,
         [ 44.,  42.,  38.,  ...,  34.,  37.,  39.],
         [ 43.,  41.,  37.,  ...,  35.,  39.,  41.],
         [ 43.,  41.,  37.,  ...,  35.,  40.,  43.]],

        [[171., 140.,  85.,  ...,   3.,   4.,   5.],
         [147., 120.,  71.,  ...,   3.,   3.,   4.],
         [103.,  83.,  47.,  ...,   2.,   2.,   2.],
         ...,
         [ 46.,  44.,  40.,  ...,  16.,  20.,  22.],
         [ 45.,  43.,  39.,  ...,  17.,  22.,  26.],
         [ 45.,  43.,  39.,  ...,  18.,  24.,  28.]]])}, ... ],)
```

None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-03-05 12:12:18 +00:00
angelayi
c3c618c750 Update torchbench pin (#121029)
Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029
Approved by: https://github.com/desertfire
2024-03-05 03:21:32 +00:00
PyTorch MergeBot
368f242e37 Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)"
This reverts commit 8c2e569928.

Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))
2024-03-03 02:58:47 +00:00
Shunting Zhang
c4ed456fc3 [inductor] fix accuracy failure for a few models under freezing (#121054)
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.

For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.

For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.

One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
2024-03-02 04:53:59 +00:00
Chien-Chin Huang
8c2e569928 [PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)
With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model.

Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454
Approved by: https://github.com/yf225, https://github.com/xmfan
2024-03-01 08:35:22 +00:00
leslie-fang-intel
950b484356 skip three pyhpc models with dynamic shape test (#120599)
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.

* Error msg is
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```

* Root Cause is
  *  Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
  * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
  ```
    shape = (
        math.ceil(2 * size ** (1/3)),
        math.ceil(2 * size ** (1/3)),
        math.ceil(0.25 * size ** (1/3)),
    )
  ```
  * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.

* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-02-29 00:38:06 +00:00
Sergii Dymchenko
d341b66e96 Revert [dynamo] support group=None when rewriting collectives (#12018) (#120677)
This reverts commit 298c686d3f.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677
Approved by: https://github.com/yifuwang, https://github.com/huydhn
2024-02-27 00:33:35 +00:00
Shunting Zhang
b381a4372b make GPT2ForSequenceClassification pass inference accuracy check (#120537)
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.

Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
2024-02-26 11:02:57 +00:00
leslie-fang-intel
c617e7b407 Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384)
After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is:  `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. `

- `torch.use_deterministic_algorithms(True)` only setting for accuracy test. fff9d98e58/benchmarks/dynamo/common.py (L3480)
- However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)

Add these 2 models into the deterministic_algorithms exclusive model list in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384
Approved by: https://github.com/desertfire, https://github.com/jgong5
2024-02-26 05:05:43 +00:00
Yifu Wang
298c686d3f [dynamo] support group=None when rewriting collectives (#120118)
Resolves case 2 in #120082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118
Approved by: https://github.com/wconstab
ghstack dependencies: #120370
2024-02-25 03:12:10 +00:00
Yukio Siraichi
cef9f70f4b Move torchbench model configuration into a YAML file. (#120299)
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.

This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
2024-02-23 14:00:14 +00:00
xiangdong
e06978be4b [CI] Add initial inductor cpu smoketest for performance (#116456)
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-02-21 20:04:50 +00:00
Yukio Siraichi
92bf2a4550 [torchbench] Update skipped models. (#120117)
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:

- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-02-19 18:08:32 +00:00
Chien-Chin Huang
c0e5cca4f8 [DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437)
Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437
Approved by: https://github.com/wconstab, https://github.com/xmfan
2024-02-13 16:53:56 +00:00
chuanqiw
074f2bb5ce Fix dynamo benchmark runner for torchbench skip sets (#118615)
Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032

This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615
Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang
2024-02-06 02:06:54 +00:00
PyTorch MergeBot
966db82c9d Revert "Remove extra graph breaks (#118987)"
This reverts commit 9a8e3b07d7.

Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))
2024-02-05 22:19:37 +00:00
Michael Lazos
9a8e3b07d7 Remove extra graph breaks (#118987)
Fixes https://github.com/pytorch/pytorch/issues/104053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987
Approved by: https://github.com/janeyx99
2024-02-03 05:55:09 +00:00
BowenBao
30f43e3d89 [ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710)
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
2024-01-31 23:03:39 +00:00
Yukio Siraichi
2f6fc33c20 Move skip sets into a new file. (#118032)
This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more
readable YAML file, so that it is consumable from other projects (e.g. XLA).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-01-24 19:22:01 +00:00
Jason Ansel
c5702a0891 [dynamo] Optimize BACKEND_MATCH guard (#118065)
As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `22.5us`
- After `18.1us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065
Approved by: https://github.com/ydwu4
2024-01-24 07:47:52 +00:00
Simon Fan
ed0ec2e0be Remove dynamo runner's dependency on distributed build (#117903)
So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903
Approved by: https://github.com/xuzhao9
2024-01-24 06:51:14 +00:00
Jane Xu
13d2cdffa2 Remove optimizer.step patching for profiler hook (#115772)
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.

```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```

2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)

<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).

There is no Python refcycle, as the backrefs for `p_ref()` look like:
![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8)
(so 5 backrefs but none of them python)

And the refs:
![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42)
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-23 20:15:41 +00:00
Bin Bao
4d625c1c92 [AOTI] Fix a bug in the torch._export.aot_load API (#118039)
Summary:
tree_flatten_spec should use args instead of *args

clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes

Test Plan: CI

Differential Revision: D52982401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039
Approved by: https://github.com/angelayi
2024-01-23 14:54:02 +00:00
Michael Lazos
f302a0d380 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-19 04:28:50 +00:00
Jason Ansel
a669319450 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-18 16:20:12 +00:00
Animesh Jain
6e4e81a9ef [dynamo] Extend LazyVariableTracker to tuples (#117426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-18 15:51:28 +00:00
Bin Bao
26956980c6 [AOTI] Add torch._export.aot_load (#117610)
Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable.

Test Plan: CI

Differential Revision: D52825456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610
Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78
2024-01-18 15:02:16 +00:00
PyTorch MergeBot
b0084be114 Revert "Re-enable SGD (#117434)"
This reverts commit e7fac72be7.

Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))
2024-01-18 11:37:36 +00:00
Michael Lazos
e7fac72be7 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-18 06:47:15 +00:00
Nikita Shulga
a1afd1b195 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
It should have never been landed, but was landed again, thanks to
ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910

This reverts commit e457b6fb18.
2024-01-17 17:06:32 -08:00
titaiwangms
e457b6fb18 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 23:03:15 +00:00
PyTorch MergeBot
da6abaeeac Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit bb0fd1bd3c.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))
2024-01-17 19:34:26 +00:00
titaiwangms
bb0fd1bd3c [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 19:12:24 +00:00
PyTorch MergeBot
9da01affd3 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit 3a52147cc5.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
Jason Ansel
3a52147cc5 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-16 22:30:04 +00:00
Simon Fan
4b25948ee6 Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332)
- Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp`
- Append rank name to traces to avoid all ranks trying to create the same file
- Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2024-01-12 22:41:09 +00:00
Simon Fan
88bf84f106 [benchmark] add --compile-autograd to dynamo benchmarks (#117196)
Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats

e.g. accuracy_inductor.csv
```
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1
cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0
cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0
cuda,LearningToPaint,4,pass,639,2,8,7,1,1
...
```

e.g. speedup_inductor.csv
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1
cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196
Approved by: https://github.com/jansel
2024-01-11 20:12:58 +00:00
Bin Bao
7e9cbc6834 [CI] Catch more exception types when running eager in PT2 tests (#117120)
Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120
Approved by: https://github.com/huydhn
2024-01-11 17:46:11 +00:00
Huy Do
3b2ddb6f71 Update TorchBench pinned commit (#117073)
~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2.  This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336.  I think this can be updated in main too because the current pinned commit is already 4-month old.~~

Check with @desertfire, trying to update TorchBench pinned commit instead.

The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073
Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire
2024-01-11 08:35:00 +00:00
Bin Bao
b8374314cc [AOTI] Update AOTI runner util (#116971)
Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971
Approved by: https://github.com/chenyang78
2024-01-09 19:07:54 +00:00
Huy Do
3c7f358c91 Update the expected accuracy value for demucs (#116944)
Update the expected value with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py b847290ddd9c6a5a598c70f8b660ee2b1e71dc95` as this is now failing in trunk after 95041829c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116944
Approved by: https://github.com/voznesenskym
2024-01-07 13:34:51 +00:00
Bin Bao
640d46f823 [inductor] Control the cpp_wrapper mode with an env variable (#116615)
Summary: also add one model test for the cpp_wrapper mode on CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615
Approved by: https://github.com/angelayi
2024-01-02 21:50:25 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
Isuru Fernando
a254fbfd61 Initialize variable for all codepaths in dynamo benchmarks (#116260)
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano
2023-12-26 05:15:39 +00:00