Commit Graph

351 Commits

Author SHA1 Message Date
Arash Pakbin
f3ddc08ddc Additional operators in operator benchmark (#145625)
The list of added operators:
add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where

This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625
Approved by: https://github.com/jeffdaily
2025-01-26 19:20:02 +00:00
Aaron Orenstein
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
Arash Pakbin
a37db5ae39 operator benchmark change parsing from regex based to manual (#144297)
The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2']

The new manual parser handles this situation the right way:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297
Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily
2025-01-10 19:15:36 +00:00
Arash Pakbin
86c3370bc3 operator benchmark: write output to a JSON (#142809)
This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well.

Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809
Approved by: https://github.com/albanD
2024-12-14 01:42:00 +00:00
Tom Ritchford
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
Xuehai Pan
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
Pavel Belevich
a3e1416c05 Fix out_tensor device in diag_test.py (#134020)
This benchmark fails if device='cuda' but out_tensor is on cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020
Approved by: https://github.com/soulitzer
2024-08-21 20:43:39 +00:00
laithsakka
7673ee5456 remove benchmarks/__init__.py (#133390)
trying to address https://github.com/pytorch/pytorch/issues/133377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang
2024-08-15 19:08:10 +00:00
laithsakka
f5e704a6f2 Add instruction count benchmark to run on pull requests (#131475)
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.

to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
2024-08-12 05:20:26 +00:00
Xu Zhao
4eee2e7a6d [operator_benchmark] Remove TARGETS from broken benchmarks (#131460)
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4

Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.

Test Plan: Sandcastle CI

Differential Revision: D60086216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
2024-07-23 20:06:08 +00:00
Xuehai Pan
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
Xuehai Pan
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
diwei sun
62311257ad Add 1 test case for Convtranspose1D in op microbenchmark (#127216)
Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR.

I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary:

Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G.
Performance comparison between torch 1.13 vs. torch 2.2
Benchmarking **PyTorch1.13**: ConvTranspose1d Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **0.96s**

Benchmarking **PyTorch2.2:** ConvTranspose1d
Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **7.988s**

Also benchmarking for 7 rounds to check the variance.

  | Round1 | Round2 | Round3 | Round4 | Round5 | Round6 | Round7 | Normalized   Variance
-- | -- | -- | -- | -- | -- | -- | -- | --
Pytorch1.13 | 0.971 | 0.972 | 0.969 | 0.970 | 0.972 | 0.970 | 0.971 | 0.0002%
Pytorch 2.2 | 8.064 | 8.053 | 8.027 | 7.927 | 7.971 | 7.929 | 7.902 | 0.0059%
Ratio v2.2 vs.   v1.13(Lower is better) | 8.31 | 8.28 | 8.29 | 8.18 | 8.20 | 8.18 | 8.14 |  

Reproduce script:
numctl -N 0 python -m pt.conv_test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-06-12 05:33:54 +00:00
cyy
2fd75667b4 [Caffe2]Remove Caffe2 scripts and benchmarks (#126747)
Due to removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-05 23:46:31 +00:00
Xuehai Pan
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
PyTorch MergeBot
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af6.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
Xuehai Pan
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
Xuehai Pan
0dae2ba5bd [2/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort caffe2 (#127123)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122
2024-05-25 18:26:34 +00:00
Aaron Gokaslan
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
Aaron Gokaslan
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
baocheny
edd03f975f highlight readme code block (#120228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228
Approved by: https://github.com/mikaylagawarecki
2024-02-22 21:23:08 +00:00
sanchitintel
8852bb561c More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367)
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,

In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.

### Benchmarks

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.

Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`

#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03%  |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|

#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio  (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-17 02:26:29 +00:00
Aaron Gokaslan
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
baocheny
e01e00fba8 fix code spell (#116530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116530
Approved by: https://github.com/albanD
2023-12-29 12:58:38 +00:00
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
FFFrog
9a1cdcb8a0 Format: fixing multiple string concatenation in single line (#106013)
Fixing multiple string concatenation in single line
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013
Approved by: https://github.com/albanD
2023-07-26 18:39:18 +00:00
Edward Z. Yang
dd3a77bc96 Apply UFMT to all files in benchmarks/ (#105928)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928
Approved by: https://github.com/albanD
2023-07-26 01:18:48 +00:00
Justin Chu
5ef023b05a [BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429
Approved by: https://github.com/malfet
2023-07-19 04:46:37 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Xuehai Pan
8d45f555d7 [BE] [1/3] Rewrite super() calls in caffe2 and benchmarks (#94587)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587
Approved by: https://github.com/ezyang
2023-02-11 18:19:48 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
sanchitintel
c4544bc169 Fix thread-allocation in _vec_log_softmax_lastdim (#85398)
## Problem history

There seems to always have been a bug in `_vec_log_softmax_lastdim `.
In particular, there were two issues with it -

#### Bug 1
 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`:
 `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();`

It was  `256` for float32, bfloat16, and float16.
When AVX512 support was added, `CHUNK_SIZE` became `512`.

The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it.

#### Bug 2
`grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)`
So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256.
Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases.

#### Problem caused by bugs
With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`!
When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700.
In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`.
AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads.

## Solution
Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now,
and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart).

## Benchmarks

##### Machine-config:
Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake)
One socket of 26 physical cores was used.
Intel OpenMP & tcmalloc were preloaded.

Example of a command to run benchmark:
`ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu`

Benchmark | Old implementation time (us) | New implementation time (us) | Speedup ratio (old/new)
-- | -- | -- | --
LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 | 11069.281 | 2651.186 | 4.17x
LogSoftmax_N1024_seq_len23258_dim1_cpu  AVX512 | 18292.928 | 2586.550| 7.07x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX2 | 9611.902 | 1762.833 | 5.452x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX512 | 12168.371  | 1717.824 | 7.08x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano
2023-02-07 15:09:05 +00:00
salilsdesai
323e0143d6 [Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918)
@bypass-github-export-checks

Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with

The configs are copied from Conv2d with the kernel parameter removed.

I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case.

Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918
Approved by: https://github.com/mcr229
2023-01-10 21:36:37 +00:00
Sergii Dymchenko
30edd39bdc Fix non-existing parameters in docstrings in benchmarks (#91115)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115
Approved by: https://github.com/clee2000
2022-12-20 02:07:32 +00:00
Kazuaki Ishizaki
14d5f139d2 Fix typos under benchmarks, test, and tools directories (#87975)
This PR fixes typos in `.md` files under benchmarks, test, and tools directories
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975
Approved by: https://github.com/kit1980
2022-10-29 01:26:17 +00:00
Nicolas Hug
97de281176 Improve interpolate() speed for channels_last CPU images and masks (#86361)
This PR improves the speed of `interpolate()`:
- on CPU
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-11 16:17:36 +00:00
PyTorch MergeBot
3a2cfbb813 Revert "Improve interpolate() speed for channels_last images and masks (#86361)"
This reverts commit 93b2d99158.

Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process
2022-10-11 10:17:27 +00:00
Nicolas Hug
93b2d99158 Improve interpolate() speed for channels_last images and masks (#86361)
This PR improves the speed of `interpolate()`:
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-07 07:52:36 +00:00
Zafar
0e30da3f2f [refactor] Renaming ao.sparsity to ao.pruning (#84867)
`Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development.

This change will reflect the changes in the documentation as well.

**TODO:**
- [ ] Change the tutorials
- [ ] Confirm no bc-breakages
- [ ] Reflect the changes in the trackers and RFC docs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867
Approved by: https://github.com/supriyar
2022-10-07 00:58:41 +00:00
zaf
2f04ba2c7c [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:38 +00:00
zaf
d32a762147 [quant][ao_migration] torch.nn.quantized.dynamictorch.ao.nn.quantized.dynamic (#78714)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:34 +00:00
zaf
c92e5ac95b [quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- Documentation @vkuzo
  - docs/source/conf.py
  - docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
  - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
  - torch/csrc/jit/passes/quantization/helper.h
  - torch/csrc/jit/serialization/import_source.cpp

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/)

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:33 +00:00
PyTorch MergeBot
6a9c02339d Revert "[quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)"
This reverts commit 432f037498.

Reverted https://github.com/pytorch/pytorch/pull/78713 on behalf of https://github.com/janeyx99 due to Reverting for breaking (trunk-only) ios build
2022-08-22 07:32:37 +00:00
PyTorch MergeBot
b1a7b67529 Revert "[quant][ao_migration] torch.nn.quantized.dynamictorch.ao.nn.quantized.dynamic (#78714)"
This reverts commit e6fb97d8ae.

Reverted https://github.com/pytorch/pytorch/pull/78714 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:30:48 +00:00
PyTorch MergeBot
4cbb1986fe Revert "[quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)"
This reverts commit 7cd2fa1d38.

Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:23:24 +00:00
zaf
7cd2fa1d38 [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-22 05:33:23 +00:00