This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.
to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
@bypass-github-export-checks
Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with
The configs are copied from Conv2d with the kernel parameter removed.
I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case.
Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918
Approved by: https://github.com/mcr229
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767
This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps:
1. create a scripted module from existing benchmark case.
2. run mobile specific optimization pass on the scripted module
3. run the scripted module on AiBench by calling its Python API
A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation).
Test Plan:
## local op_bench run
buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1
buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit
Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version
```
RuntimeError:
Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function.
The error stack is reproduced here:
Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript:
File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260
quant_min: int, quant_max: int
):
return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
:
File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313
axis: int, quant_min: int, quant_max: int
):
return self.op_func(input, scale, zero_point, axis, quant_min, quant_max)
~~~~~~~~~~~~ <--- HERE
```
`_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105
## OSS test
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1
## saved module graph
```
module __torch__.mobile_benchmark_utils.OpBenchmarkMobile {
parameters {
}
attributes {
training = True
num_iters = 1
benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50>
}
methods {
method forward {
graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile):
%12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4
%4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
%1 : int = prim::GetAttr[name="num_iters"](%self)
= prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
block0(%i : int):
%6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
%7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
%self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
%9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple)
%23 : int = prim::Constant[value=1]()
%24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
-> (%4)
return (%12)
}
}
submodules {
module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark {
parameters {
}
attributes {
mobile_optimized = True
}
methods {
method forward {
graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark,
%input_one.1 : Tensor,
%input_two.1 : Tensor):
%3 : int = prim::Constant[value=1]()
%4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
return (%4)
}
method get_inputs {
graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark):
%self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
return (%self.inputs_tuple)
}
}
submodules {
}
}
}
}
```
Reviewed By: kimishpatel
Differential Revision: D24322214
fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679
Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978
Test Plan: waitforsandcastle
Reviewed By: vkuzo
Differential Revision: D24459631
fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761
Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23012533
fbshipit-source-id: af30ee585389395569a6322f5210828432963077
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673
Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```
Imported from OSS
Differential Revision: D21102563
fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29818
When some of the test running on cuda, there is a runtime error because of missing data transfer from cpu to cuda. This diff fixes that issue.
Test Plan:
```
buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M64_N64_K64_cpu
# Input: M: 64, N: 64, K: 64, device: cpu
Forward Execution Time (us) : 165.241
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M64_N64_K64_cuda
# Input: M: 64, N: 64, K: 64, device: cuda
Forward Execution Time (us) : 56.546
...
Reviewed By: hl475
Differential Revision: D18506269
fbshipit-source-id: 87942d7a52bd398600766c0f5363d791b74a6ca6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29616
1. Reduce the predefined_min_time which is the minimum time each test needs to run. Based on the test result, the average time across different epoch are pretty stable before exiting. So we can safely reduce the predefined time here.
2. Chang the input shapes of several ops
Test Plan:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: add
200 256.044864655
400 165.850520134
800 163.579881191
1600 162.871927023
3200 160.3128016
# Mode: Eager
# Name: add_cpu_M64_K64_bwd1_N64
# Input: device: cpu, K: 64, M: 64, N: 64
Backward Execution Time (us) : 164.715
# Benchmarking PyTorch: add
200 170.650482178
400 168.895125389
800 169.867575169
1600 163.400024176
3200 168.658420444
# Mode: Eager
# Name: add_cpu_M64_K64_bwd2_N64
# Input: device: cpu, K: 64, M: 64, N: 64
Backward Execution Time (us) : 168.777
Reviewed By: hl475
Differential Revision: D18438540
fbshipit-source-id: 1fd27cf4bbc34e46e74393af912ee2fcb75c33b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25477
We want to increase `in_c, out_c` so that the metric reported back are more stable
Test Plan:
```[huaminli@devvm2388.ftw3 ~/fbsource/fbcode] buck run mode/dev-nosan caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --operators None --iterations 3
```
runs fine on my devserver, last couple lines of output P107448746
Reviewed By: mingzhe09088
Differential Revision: D17133043
fbshipit-source-id: 0b989a530cbfe3d608471a30ae4bbda10e5216ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24136
This diff aims to reduce the execution of benchmark_all_test which runs all the supported operator benchmarks. In the default run, only one shape of each operator will be benchmarked. The rest of the benchmarks can be triggered with tag_filter flag.
Reviewed By: hl475
Differential Revision: D16736448
fbshipit-source-id: 33bd86f6fc2610f87f24240ad559fb11d3e35e89