Commit Graph

327 Commits

Author SHA1 Message Date
Aaron Gokaslan
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
PyTorch MergeBot
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e4322.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
Aaron Gokaslan
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
FFFrog
9a1cdcb8a0 Format: fixing multiple string concatenation in single line (#106013)
Fixing multiple string concatenation in single line
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013
Approved by: https://github.com/albanD
2023-07-26 18:39:18 +00:00
Edward Z. Yang
dd3a77bc96 Apply UFMT to all files in benchmarks/ (#105928)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928
Approved by: https://github.com/albanD
2023-07-26 01:18:48 +00:00
Justin Chu
5ef023b05a [BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429
Approved by: https://github.com/malfet
2023-07-19 04:46:37 +00:00
Aaron Gokaslan
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
Aaron Gokaslan
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
Xuehai Pan
8d45f555d7 [BE] [1/3] Rewrite super() calls in caffe2 and benchmarks (#94587)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587
Approved by: https://github.com/ezyang
2023-02-11 18:19:48 +00:00
Xuehai Pan
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
Aaron Gokaslan
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
sanchitintel
c4544bc169 Fix thread-allocation in _vec_log_softmax_lastdim (#85398)
## Problem history

There seems to always have been a bug in `_vec_log_softmax_lastdim `.
In particular, there were two issues with it -

#### Bug 1
 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`:
 `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();`

It was  `256` for float32, bfloat16, and float16.
When AVX512 support was added, `CHUNK_SIZE` became `512`.

The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it.

#### Bug 2
`grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)`
So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256.
Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases.

#### Problem caused by bugs
With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`!
When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700.
In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`.
AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads.

## Solution
Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now,
and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart).

## Benchmarks

##### Machine-config:
Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake)
One socket of 26 physical cores was used.
Intel OpenMP & tcmalloc were preloaded.

Example of a command to run benchmark:
`ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu`

Benchmark | Old implementation time (us) | New implementation time (us) | Speedup ratio (old/new)
-- | -- | -- | --
LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 | 11069.281 | 2651.186 | 4.17x
LogSoftmax_N1024_seq_len23258_dim1_cpu  AVX512 | 18292.928 | 2586.550| 7.07x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX2 | 9611.902 | 1762.833 | 5.452x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX512 | 12168.371  | 1717.824 | 7.08x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano
2023-02-07 15:09:05 +00:00
salilsdesai
323e0143d6 [Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918)
@bypass-github-export-checks

Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with

The configs are copied from Conv2d with the kernel parameter removed.

I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case.

Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918
Approved by: https://github.com/mcr229
2023-01-10 21:36:37 +00:00
Sergii Dymchenko
30edd39bdc Fix non-existing parameters in docstrings in benchmarks (#91115)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115
Approved by: https://github.com/clee2000
2022-12-20 02:07:32 +00:00
Kazuaki Ishizaki
14d5f139d2 Fix typos under benchmarks, test, and tools directories (#87975)
This PR fixes typos in `.md` files under benchmarks, test, and tools directories
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975
Approved by: https://github.com/kit1980
2022-10-29 01:26:17 +00:00
Nicolas Hug
97de281176 Improve interpolate() speed for channels_last CPU images and masks (#86361)
This PR improves the speed of `interpolate()`:
- on CPU
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-11 16:17:36 +00:00
PyTorch MergeBot
3a2cfbb813 Revert "Improve interpolate() speed for channels_last images and masks (#86361)"
This reverts commit 93b2d99158.

Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process
2022-10-11 10:17:27 +00:00
Nicolas Hug
93b2d99158 Improve interpolate() speed for channels_last images and masks (#86361)
This PR improves the speed of `interpolate()`:
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-07 07:52:36 +00:00
Zafar
0e30da3f2f [refactor] Renaming ao.sparsity to ao.pruning (#84867)
`Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development.

This change will reflect the changes in the documentation as well.

**TODO:**
- [ ] Change the tutorials
- [ ] Confirm no bc-breakages
- [ ] Reflect the changes in the trackers and RFC docs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867
Approved by: https://github.com/supriyar
2022-10-07 00:58:41 +00:00
zaf
2f04ba2c7c [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:38 +00:00
zaf
d32a762147 [quant][ao_migration] torch.nn.quantized.dynamictorch.ao.nn.quantized.dynamic (#78714)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:34 +00:00
zaf
c92e5ac95b [quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- Documentation @vkuzo
  - docs/source/conf.py
  - docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
  - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
  - torch/csrc/jit/passes/quantization/helper.h
  - torch/csrc/jit/serialization/import_source.cpp

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/)

Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168
2022-08-25 16:50:33 +00:00
PyTorch MergeBot
6a9c02339d Revert "[quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)"
This reverts commit 432f037498.

Reverted https://github.com/pytorch/pytorch/pull/78713 on behalf of https://github.com/janeyx99 due to Reverting for breaking (trunk-only) ios build
2022-08-22 07:32:37 +00:00
PyTorch MergeBot
b1a7b67529 Revert "[quant][ao_migration] torch.nn.quantized.dynamictorch.ao.nn.quantized.dynamic (#78714)"
This reverts commit e6fb97d8ae.

Reverted https://github.com/pytorch/pytorch/pull/78714 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:30:48 +00:00
PyTorch MergeBot
4cbb1986fe Revert "[quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)"
This reverts commit 7cd2fa1d38.

Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted
2022-08-22 07:23:24 +00:00
zaf
7cd2fa1d38 [quant][ao_migration] torch.nn.qattorch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
    - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- None

Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
2022-08-22 05:33:23 +00:00
zaf
e6fb97d8ae [quant][ao_migration] torch.nn.quantized.dynamictorch.ao.nn.quantized.dynamic (#78714)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a

Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168
2022-08-22 05:22:00 +00:00
zaf
432f037498 [quant][ao_migration] torch.nn.quantized.modulestorch.ao.nn.quantized.modules (#78713)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
    - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
    - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
    - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
    - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
    - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
    - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
    - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
    - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
    - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
        - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
        - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- Documentation @vkuzo
  - docs/source/conf.py
  - docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
  - torch/csrc/jit/passes/hoist_conv_packed_params.cpp
  - torch/csrc/jit/passes/quantization/helper.h
  - torch/csrc/jit/serialization/import_source.cpp

Differential Revision: [D36860145](https://our.internmc.facebook.com/intern/diff/D36860145/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168
2022-08-22 01:38:55 +00:00
zaf
78c8a0d752 [quant][ao_migration] torch.nn.quantized.functionaltorch.ao.nn.quantized.functional (#78712)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.

The list of the `nn.quantized` files that are being migrated:

- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
  - [X] [Current PR] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
  - [ ] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
  - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
  - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
  - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
  - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
  - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
  - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
  - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
    - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
    - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`

Majority of the files are just moved to the new location.
However, specific files need to be double checked:

- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10

Differential Revision: [D36792967](https://our.internmc.facebook.com/intern/diff/D36792967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36792967/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78712
Approved by: https://github.com/jerryzh168
2022-08-18 17:51:54 +00:00
Sergii Dymchenko
a7299647d3 Re-enable non-contig in qarithmetic_test.py (#82345)
As https://github.com/pytorch/pytorch/issues/29435 is closed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82345
Approved by: https://github.com/clee2000, https://github.com/huydhn
2022-07-29 00:10:50 +00:00
zaf
55d1b376ea [ao][sparsity] Vectorized WeightNormSparsifier (#80059)
The previous implementation was using loops to compute the sparsity within a block in a mask, as well as across the mask blocks. This implements the vectorized version.

## Vectorization:

A high level overview of the vectorization procedure falls into a two step process:

### Tensor-level masking

A tensor-level masking is a mask generation routine that has a granularity of `sparse_block_shape`. That means that only patches of that shape can be considered sparse/dense. To vectorize:

1. Reshape the data such that one of the dimensions represents the patches of sparse_block_shape.
2. Create a mask of the same shape as the reshaped data
3. Find the smallest `k` elements in the the data, given the dimension of the sparse "patches". `k` represents a derived paramter specifying the sparsity level.
4. Apply the 0/1 to the patches in the mask
5. Reshape the mask back to the original dimensions

Note: because the shape of the mask might not be multiple of the sparse_block_shape, we nudge the sshape of the mask, and truncate it afterwards.

## Block-level masking

A block-level masking is a mask generation routine that concerns itself only with sparsity within a patch of shape `sparse_block_shape`. This is useful when block sparsity allows partial block sparsification.

To vectorize:

Overall the block-level masking follows the same routine as the tensor-level algorithm described above. One distinction is that when reshaping the data/mask tensors we aim for creating a dimension that captures the internals of each patch. For example, if a `sparse_block_shape` is `(2, 2)`, we want to reshape the data/mask into `(2, 2, -1)`. That allows us to sort the internal elements on the last axis, and zero-out the ones that obey the sparse logic.

Differential Revision: [D37352494](https://our.internmc.facebook.com/intern/diff/D37352494/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37352494/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80059
Approved by: https://github.com/jerryzh168
2022-07-12 19:16:44 +00:00
Norman Ponte
2e8b9c7785 [TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035

- modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change

Reviewed By: tgalkovskyi

Differential Revision: D35257017

fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f
(cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)
2022-04-01 00:35:42 +00:00
Sergii Dymchenko
5b011fc6eb Fix Undefined variable in QInterpolateBenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130
Approved by: https://github.com/malfet
2022-03-09 00:14:15 +00:00
Sergii Dymchenko
486572223b Fix command example (#72847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847

Reviewed By: malfet

Differential Revision: D34260868

Pulled By: kit1980

fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7
(cherry picked from commit c9e874c4d8)
2022-02-16 21:45:45 +00:00
Ben Koopman
c2c859bdf2 [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560

Test Plan: Imported from OSS

Reviewed By: HDCharles

Differential Revision: D31618282

Pulled By: b-koopman

fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3
2021-11-19 06:29:19 -08:00
Michael Suo
5c3529a86d [lint] small pass to make lint clean (#68367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367

- bmm_test.py was using syntax not allowed in 3.6
- Some suppressions were not placed on the correct line.

With this file,
```
lintrunner --paths-cmd='git grep -Il .'
```
passes successfully.

Test Plan: Imported from OSS

Reviewed By: janeyx99, mrshenli

Differential Revision: D32436644

Pulled By: suo

fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2
2021-11-16 10:27:00 -08:00
Shashank Chaudhry
89c4e8c22b [NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746

Test Plan: Visual inspection. Sandcastle.

Reviewed By: zertosh

Differential Revision: D31986646

fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8
2021-11-03 12:23:14 -07:00
Bin Wen
6900aacf54 [fbcode] Fix operator_benchmark with jit mode (#67382)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382

two simple updates:

* fix running benchmark with --use_jit. Previously will fail with error

  torch.jit.frontend.UnsupportedNodeError: import statements aren't supported:
  File "/proc/self/fd/3/bmm_test.py", line 9
  def __invoke_main():
    import ctypes
    ~~~~~~ <--- HERE
    import ctypes.util
    import errno

* add matmul to bmm benchmark as D31837588

Test Plan:
buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test --  --forward_only=True --mkl_num_threads=1 --omp_num_threads=1
 --use_jit=True

Reviewed By: ShijunK

Differential Revision: D31960528

fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f
2021-10-28 08:48:10 -07:00
Vasiliy Kuznetsov
d802877dfa speed up quantized interpolate for channels last (#66525)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525

This should solve https://github.com/pytorch/pytorch/issues/60015

There were two `q_zero_point()` accesses inside a for loop which was
expensive. Moving them to before the loop sped things up 10x for a
microbenchmark.

Test Plan:
```
// comment out benchmarks unrelated to original issue, for simplicity
cd benchmarks/operator_benchmark
python -m pt.qinterpolate_test

// before: 2994 us
// after: 324 us
// full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D31592422

fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459
2021-10-14 08:11:26 -07:00
Alexandr Guzhva
b8e1999253 [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183

Add a GPU benchmark for fakeQuant, similar to #65241
ghstack-source-id: 139810414

Test Plan: https://pxl.cl/1QjJM

Reviewed By: b-koopman

Differential Revision: D31288158

fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d
2021-10-06 08:07:42 -07:00
Vasiliy Kuznetsov
e3af4be963 pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833

Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks`
folder.

```
find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g"
```

Test Plan: CI

Reviewed By: z-a-f

Differential Revision: D31275963

fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116
2021-10-01 06:17:36 -07:00
Philip Meier
aebde1bc2b deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D31141433

Pulled By: mruberry

fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732
2021-09-29 02:40:52 -07:00
Ben Koopman
6a6ee92e36 [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241

Test Plan: Imported from OSS

Reviewed By: jingsh

Differential Revision: D31150087

Pulled By: b-koopman

fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727
2021-09-27 16:01:49 -07:00
Eddie Ren
9c73a48ecf ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707

Use torch.randn instead of torch.from_numpy to generate the tensor

Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test

Reviewed By: jingsh

Differential Revision: D30817302

fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672
2021-09-13 06:47:35 -07:00
Eddie Ren
3fbb49e75d Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647

Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim.

Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test```

Reviewed By: jingsh

Differential Revision: D30770085

fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e
2021-09-10 16:49:02 -07:00
Harut Movsisyan
956c8fa01e Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654

Test Plan:
```
> buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 27.970

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 41.830

# Benchmarking PyTorch: einsum_bmm
# Mode: Eager
# Name: einsum_bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 499.114

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B4_M5_N3_K2_cpu
# Input: B: 4, M: 5, N: 3, K: 2, device: cpu
Forward Execution Time (us) : 6.268

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B32_M25_N20_K30_cpu
# Input: B: 32, M: 25, N: 20, K: 30, device: cpu
Forward Execution Time (us) : 12.676

# Benchmarking PyTorch: bmm
# Mode: Eager
# Name: bmm_B128_M100_N120_K110_cpu
# Input: B: 128, M: 100, N: 120, K: 110, device: cpu
Forward Execution Time (us) : 438.219

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 7.657

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 18.523

# Benchmarking PyTorch: einsum_elementwise
# Mode: Eager
# Name: einsum_elementwise_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 55.103

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B4_M5_N3_cpu
# Input: B: 4, M: 5, N: 3, device: cpu
Forward Execution Time (us) : 2.501

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B32_M25_N20_cpu
# Input: B: 32, M: 25, N: 20, device: cpu
Forward Execution Time (us) : 10.589

# Benchmarking PyTorch: mul
# Mode: Eager
# Name: mul_B100_M90_N110_cpu
# Input: B: 100, M: 90, N: 110, device: cpu
Forward Execution Time (us) : 50.102

Reviewed By: ajyu

Differential Revision: D30455179

fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b
2021-08-24 16:26:26 -07:00
Supriya Rao
7a15576a65 [quant] update FakeQuant modules to use tensor qparams (#61318)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318

Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator.

Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU.
Local benchmark P427668213

Before this change
```
                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::_aminmax         2.57%       1.507ms         3.10%       1.819ms      36.371us       2.872ms         4.81%       2.872ms      57.446us            50
              aten::fake_quantize_per_tensor_affine         1.04%     610.915us         3.60%       2.114ms      42.276us     472.896us         0.79%       2.698ms      53.962us            50
    aten::fake_quantize_per_tensor_affine_cachemask         1.69%     993.626us         2.56%       1.503ms      30.058us       2.225ms         3.73%       2.225ms      44.504us            50
                                   aten::is_nonzero         3.85%       2.258ms        19.68%      11.540ms      46.161us       2.168ms         3.63%      11.084ms      44.336us           250
                                   aten::zeros_like         1.82%       1.064ms         6.65%       3.901ms      39.007us       1.531ms         2.57%       3.905ms      39.045us           100
                                           aten::eq        13.80%       8.093ms        25.90%      15.189ms      37.972us       9.580ms        16.05%      15.566ms      38.914us           400
                                         aten::item         5.67%       3.323ms        21.50%      12.607ms      36.019us       3.233ms         5.42%      12.167ms      34.762us           350
                                        aten::zeros         0.94%     549.208us         2.93%       1.717ms      34.343us     688.928us         1.15%       1.695ms      33.894us            50
                                           aten::le         2.52%       1.478ms         4.50%       2.641ms      26.411us       1.753ms         2.94%       2.845ms      28.448us           100
                                         aten::rsub         1.04%     608.715us         2.44%       1.433ms      28.667us     532.000us         0.89%       1.418ms      28.353us            50
                                          aten::max         1.54%     905.401us         4.62%       2.711ms      27.106us     847.488us         1.42%       2.697ms      26.969us           100
                                         aten::ones         0.92%     542.159us         2.16%       1.266ms      25.324us     661.856us         1.11%       1.301ms      26.017us            50
                                          aten::min         0.82%     479.167us         2.15%       1.258ms      25.160us     407.808us         0.68%       1.276ms      25.530us            50
                          aten::_local_scalar_dense        15.83%       9.284ms        15.83%       9.284ms      26.526us       8.934ms        14.97%       8.934ms      25.524us           350
                                        aten::clamp         2.35%       1.378ms         4.21%       2.467ms      24.669us       1.546ms         2.59%       2.461ms      24.612us           100
                                        aten::zero_         2.53%       1.482ms         5.65%       3.316ms      22.108us       1.326ms         2.22%       3.380ms      22.531us           150
                                      aten::maximum         3.08%       1.805ms         3.08%       1.805ms      18.052us       1.849ms         3.10%       1.849ms      18.494us           100
                                      aten::minimum         1.33%     778.854us         1.33%     778.854us      15.577us     868.672us         1.46%     868.672us      17.373us            50
                                        aten::round         1.36%     799.910us         1.36%     799.910us      15.998us     809.568us         1.36%     809.568us      16.191us            50
                                        aten::copy_         6.61%       3.878ms         6.61%       3.878ms      15.513us       4.036ms         6.76%       4.036ms      16.143us           250
                                          aten::div         2.53%       1.483ms         2.53%       1.483ms      14.833us       1.535ms         2.57%       1.535ms      15.353us           100
                                          aten::mul         2.44%       1.431ms         2.44%       1.431ms      14.314us       1.478ms         2.48%       1.478ms      14.782us           100
                                       aten::detach         1.46%     855.670us         2.41%       1.411ms      14.110us     832.448us         1.39%       1.395ms      13.949us           100
                                          aten::add         2.22%       1.301ms         2.22%       1.301ms      13.008us       1.383ms         2.32%       1.383ms      13.828us           100
                                        aten::fill_         4.18%       2.452ms         4.18%       2.452ms      12.262us       2.693ms         4.51%       2.693ms      13.463us           200
                                          aten::sub         5.06%       2.967ms         5.06%       2.967ms      14.837us       2.675ms         4.48%       2.675ms      13.374us           200
                                           aten::to         2.10%       1.230ms         3.65%       2.140ms      10.701us       1.310ms         2.20%       2.062ms      10.310us           200
                                       aten::select         1.28%     749.144us         1.49%     874.227us       8.742us     863.232us         1.45%     863.232us       8.632us           100
                                             detach         0.95%     555.326us         0.95%     555.326us       5.553us     562.496us         0.94%     562.496us       5.625us           100
                                   aten::as_strided         0.40%     232.289us         0.40%     232.289us       1.161us       0.000us         0.00%       0.000us       0.000us           200
                                        aten::empty         2.93%       1.720ms         2.93%       1.720ms       3.439us       0.000us         0.00%       0.000us       0.000us           500
                                      aten::resize_         1.04%     611.313us         1.04%     611.313us       2.038us       0.000us         0.00%       0.000us       0.000us           300
                                   aten::empty_like         0.75%     438.585us         1.77%       1.036ms       5.180us       0.000us         0.00%       0.000us       0.000us           200
                                aten::empty_strided         1.36%     799.442us         1.36%     799.442us       3.198us       0.000us         0.00%       0.000us       0.000us           250
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 58.645ms
Self CUDA time total: 59.674ms
```

After this change
```

test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                  aten::fake_quantize_per_tensor_affine         0.98%     505.210us         4.38%       2.259ms      45.187us     419.424us         0.78%       3.218ms      64.367us            50
                                         aten::_aminmax         2.78%       1.434ms         3.42%       1.766ms      35.321us       2.825ms         5.27%       2.825ms      56.505us            50
aten::fake_quantize_per_tensor_affine_cachemask_tens...         2.38%       1.229ms         3.40%       1.754ms      35.083us       2.799ms         5.22%       2.799ms      55.979us            50
                                             aten::rsub         0.94%     485.040us         5.02%       2.590ms      51.793us     458.976us         0.86%       2.587ms      51.747us            50
                                       aten::is_nonzero         3.78%       1.952ms        23.64%      12.196ms      48.786us       2.055ms         3.83%      11.986ms      47.944us           250
                                             aten::item         6.92%       3.572ms        19.86%      10.244ms      40.977us       3.670ms         6.85%       9.931ms      39.724us           250
                                       aten::zeros_like         1.65%     848.874us         6.64%       3.426ms      34.260us       1.397ms         2.61%       3.572ms      35.717us           100
                                            aten::zeros         0.85%     436.691us         3.00%       1.549ms      30.984us     551.936us         1.03%       1.576ms      31.516us            50
                                               aten::eq        10.60%       5.467ms        20.26%      10.452ms      26.130us       7.018ms        13.09%      10.832ms      27.079us           400
                                               aten::le         2.58%       1.332ms         4.67%       2.407ms      24.074us       1.580ms         2.95%       2.614ms      26.144us           100
                              aten::_local_scalar_dense        12.93%       6.673ms        12.93%       6.673ms      26.691us       6.261ms        11.68%       6.261ms      25.046us           250
                                            aten::clamp         2.43%       1.253ms         4.37%       2.256ms      22.560us       1.431ms         2.67%       2.273ms      22.725us           100
                                             aten::ones         0.89%     460.133us         2.18%       1.123ms      22.467us     570.496us         1.06%       1.128ms      22.551us            50
                                              aten::min         0.74%     383.132us         2.06%       1.065ms      21.296us     377.536us         0.70%       1.091ms      21.824us            50
                                            aten::zero_         2.36%       1.219ms         5.87%       3.029ms      20.194us       1.261ms         2.35%       3.199ms      21.327us           150
                                              aten::max         1.51%     779.081us         4.06%       2.096ms      20.960us     791.680us         1.48%       2.130ms      21.295us           100
                                              aten::sub         7.97%       4.111ms         7.97%       4.111ms      20.556us       3.847ms         7.18%       3.847ms      19.234us           200
                                              aten::div         2.94%       1.516ms         2.94%       1.516ms      15.158us       1.580ms         2.95%       1.580ms      15.798us           100
                                            aten::round         1.45%     750.445us         1.45%     750.445us      15.009us     756.064us         1.41%     756.064us      15.121us            50
                                            aten::copy_         6.88%       3.548ms         6.88%       3.548ms      14.190us       3.701ms         6.90%       3.701ms      14.803us           250
                                          aten::minimum         1.32%     681.654us         1.32%     681.654us      13.633us     713.664us         1.33%     713.664us      14.273us            50
                                          aten::maximum         2.55%       1.317ms         2.55%       1.317ms      13.169us       1.338ms         2.50%       1.338ms      13.378us           100
                                              aten::mul         2.63%       1.358ms         2.63%       1.358ms      13.581us       1.328ms         2.48%       1.328ms      13.283us           100
                                           aten::detach         1.34%     688.820us         2.35%       1.211ms      12.110us     772.800us         1.44%       1.278ms      12.779us           100
                                            aten::fill_         4.53%       2.338ms         4.53%       2.338ms      11.692us       2.495ms         4.65%       2.495ms      12.473us           200
                                              aten::add         2.32%       1.197ms         2.32%       1.197ms      11.968us       1.240ms         2.31%       1.240ms      12.405us           100
                                               aten::to         2.07%       1.069ms         3.66%       1.889ms       9.443us       1.224ms         2.28%       1.975ms       9.874us           200
                                           aten::select         1.44%     743.042us         1.64%     848.207us       8.482us     641.600us         1.20%     641.600us       6.416us           100
                                                 detach         1.01%     522.155us         1.01%     522.155us       5.222us     505.088us         0.94%     505.088us       5.051us           100
                                       aten::as_strided         0.44%     227.884us         0.44%     227.884us       1.139us       0.000us         0.00%       0.000us       0.000us           200
                                            aten::empty         3.20%       1.652ms         3.20%       1.652ms       3.304us       0.000us         0.00%       0.000us       0.000us           500
                                          aten::resize_         1.25%     646.711us         1.25%     646.711us       2.156us       0.000us         0.00%       0.000us       0.000us           300
                                       aten::empty_like         0.79%     407.768us         2.07%       1.067ms       5.334us       0.000us         0.00%       0.000us       0.000us           200
                                    aten::empty_strided         1.52%     785.788us         1.52%     785.788us       3.143us       0.000us         0.00%       0.000us       0.000us           250
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 51.590ms
Self CUDA time total: 53.609ms
ghstack-source-id: 133370215

Test Plan: buck test mode/dev-nosan caffe2/test/:quantization

Reviewed By: raghuramank100

Differential Revision: D29566512

fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c
2021-07-10 19:43:02 -07:00
Kimish Patel
3176f16691 [Pytorch benchmark] Add BMM benchmark (#59595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595

ghstack-source-id: 130946743

Test Plan: bmm_test

Reviewed By: mingzhe09088

Differential Revision: D28873228

fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c
2021-06-10 08:24:29 -07:00
Kimish Patel
8b63573c31 [PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334

Add gelu op benchmark
ghstack-source-id: 130947172

Test Plan: gelu_test

Reviewed By: hl475

Differential Revision: D28842959

fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4
2021-06-09 16:09:37 -07:00
Rong Rong (AI Infra)
277f587496 rename benchmark_cpp_extension (#58708)
Summary:
Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708

Test Plan:
Run from `./benchmarks/operator_benchmark/pt_extension` folder:
```
python setup.py install
python cpp_extension_test.py
```

Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI

Reviewed By: robieta

Differential Revision: D28585582

Pulled By: walterddr

fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57
2021-05-24 11:04:02 -07:00