mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
3058700f7f
327 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
660e8060ad |
[BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang |
||
|
|
d59a6864fb |
Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit
|
||
|
|
88ab3e4322 |
[BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang |
||
|
|
9a1cdcb8a0 |
Format: fixing multiple string concatenation in single line (#106013)
Fixing multiple string concatenation in single line Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013 Approved by: https://github.com/albanD |
||
|
|
dd3a77bc96 |
Apply UFMT to all files in benchmarks/ (#105928)
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD |
||
|
|
5ef023b05a |
[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet |
||
|
|
2f95a3d0fc |
[BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD |
||
|
|
e2a3817dfd |
[BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet |
||
|
|
8d45f555d7 |
[BE] [1/3] Rewrite super() calls in caffe2 and benchmarks (#94587)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: |
||
|
|
a229b4526f |
[BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`:
|
||
|
|
8fce9a09cd |
[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD |
||
|
|
c4544bc169 |
Fix thread-allocation in _vec_log_softmax_lastdim (#85398)
## Problem history There seems to always have been a bug in `_vec_log_softmax_lastdim `. In particular, there were two issues with it - #### Bug 1 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`: `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();` It was `256` for float32, bfloat16, and float16. When AVX512 support was added, `CHUNK_SIZE` became `512`. The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it. #### Bug 2 `grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)` So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256. Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases. #### Problem caused by bugs With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`! When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700. In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`. AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads. ## Solution Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now, and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart). ## Benchmarks ##### Machine-config: Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake) One socket of 26 physical cores was used. Intel OpenMP & tcmalloc were preloaded. Example of a command to run benchmark: `ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu` Benchmark | Old implementation time (us) | New implementation time (us) | Speedup ratio (old/new) -- | -- | -- | -- LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 | 11069.281 | 2651.186 | 4.17x LogSoftmax_N1024_seq_len23258_dim1_cpu AVX512 | 18292.928 | 2586.550| 7.07x LogSoftmax_N700_seq_len23258_dim1_cpu AVX2 | 9611.902 | 1762.833 | 5.452x LogSoftmax_N700_seq_len23258_dim1_cpu AVX512 | 12168.371 | 1717.824 | 7.08x Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano |
||
|
|
323e0143d6 |
[Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918)
@bypass-github-export-checks Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with The configs are copied from Conv2d with the kernel parameter removed. I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case. Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918 Approved by: https://github.com/mcr229 |
||
|
|
30edd39bdc |
Fix non-existing parameters in docstrings in benchmarks (#91115)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115 Approved by: https://github.com/clee2000 |
||
|
|
14d5f139d2 |
Fix typos under benchmarks, test, and tools directories (#87975)
This PR fixes typos in `.md` files under benchmarks, test, and tools directories Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975 Approved by: https://github.com/kit1980 |
||
|
|
97de281176 |
Improve interpolate() speed for channels_last CPU images and masks (#86361)
This PR improves the speed of `interpolate()`: - on CPU - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, *HW1), HW2]) # 3 channels attrs.append([(1, 1, *HW1), HW2]) # 1 channel attrs.append([(1, 3, *HW2), HW1]) # 3 channels attrs.append([(1, 1, *HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.*?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" * 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa |
||
|
|
3a2cfbb813 |
Revert "Improve interpolate() speed for channels_last images and masks (#86361)"
This reverts commit
|
||
|
|
93b2d99158 |
Improve interpolate() speed for channels_last images and masks (#86361)
This PR improves the speed of `interpolate()`: - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, *HW1), HW2]) # 3 channels attrs.append([(1, 1, *HW1), HW2]) # 1 channel attrs.append([(1, 3, *HW2), HW1]) # 3 channels attrs.append([(1, 1, *HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.*?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" * 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa |
||
|
|
0e30da3f2f |
[refactor] Renaming ao.sparsity to ao.pruning (#84867)
`Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development. This change will reflect the changes in the documentation as well. **TODO:** - [ ] Change the tutorials - [ ] Confirm no bc-breakages - [ ] Reflect the changes in the trackers and RFC docs Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867 Approved by: https://github.com/supriyar |
||
|
|
2f04ba2c7c |
[quant][ao_migration] torch.nn.qat → torch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
- [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- None
Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
|
||
|
|
d32a762147 |
[quant][ao_migration] torch.nn.quantized.dynamic → torch.ao.nn.quantized.dynamic (#78714)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
- [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a
Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!
Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168
|
||
|
|
c92e5ac95b |
[quant][ao_migration] torch.nn.quantized.modules → torch.ao.nn.quantized.modules (#78713)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
- [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- Documentation @vkuzo
- docs/source/conf.py
- docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
- torch/csrc/jit/passes/hoist_conv_packed_params.cpp
- torch/csrc/jit/passes/quantization/helper.h
- torch/csrc/jit/serialization/import_source.cpp
Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/)
Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168
|
||
|
|
6a9c02339d |
Revert "[quant][ao_migration] torch.nn.quantized.modules → torch.ao.nn.quantized.modules (#78713)"
This reverts commit
|
||
|
|
b1a7b67529 |
Revert "[quant][ao_migration] torch.nn.quantized.dynamic → torch.ao.nn.quantized.dynamic (#78714)"
This reverts commit
|
||
|
|
4cbb1986fe |
Revert "[quant][ao_migration] torch.nn.qat → torch.ao.nn.qat (#78716)"
This reverts commit
|
||
|
|
7cd2fa1d38 |
[quant][ao_migration] torch.nn.qat → torch.ao.nn.qat (#78716)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [X] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat`
- [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- None
Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716
Approved by: https://github.com/jerryzh168
|
||
|
|
e6fb97d8ae |
[quant][ao_migration] torch.nn.quantized.dynamic → torch.ao.nn.quantized.dynamic (#78714)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
- [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
- [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo
- [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a
- [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a
Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714
Approved by: https://github.com/jerryzh168
|
||
|
|
432f037498 |
[quant][ao_migration] torch.nn.quantized.modules → torch.ao.nn.quantized.modules (#78713)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
- [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- Documentation @vkuzo
- docs/source/conf.py
- docs/source/quantization.rst
- [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168
- [common test routine](test/quantization/ao_migration/common.py) @HDCharles
- JIT stuff @jamesr66a
- torch/csrc/jit/passes/hoist_conv_packed_params.cpp
- torch/csrc/jit/passes/quantization/helper.h
- torch/csrc/jit/serialization/import_source.cpp
Differential Revision: [D36860145](https://our.internmc.facebook.com/intern/diff/D36860145/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713
Approved by: https://github.com/jerryzh168
|
||
|
|
78c8a0d752 |
[quant][ao_migration] torch.nn.quantized.functional → torch.ao.nn.quantized.functional (#78712)
Context: In order to avoid the cluttering of the `torch.nn` namespace
the quantized modules namespace is moved to `torch.ao.nn`.
The list of the `nn.quantized` files that are being migrated:
- [ ] `torch.nn.quantized` → `torch.ao.nn.quantized`
- [X] [Current PR] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional`
- [ ] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules`
- [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic`
- [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference`
- [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable`
- [ ] `torch.nn.qat` → `torch.ao.nn.qat`
- [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules`
- [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic`
- [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic`
- [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules`
- [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat`
- [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized`
- [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules`
- [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic`
Majority of the files are just moved to the new location.
However, specific files need to be double checked:
- [Documentation](docs/source/quantization-support.rst) @vkuzo
- [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10
Differential Revision: [D36792967](https://our.internmc.facebook.com/intern/diff/D36792967/)
**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36792967/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78712
Approved by: https://github.com/jerryzh168
|
||
|
|
a7299647d3 |
Re-enable non-contig in qarithmetic_test.py (#82345)
As https://github.com/pytorch/pytorch/issues/29435 is closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82345 Approved by: https://github.com/clee2000, https://github.com/huydhn |
||
|
|
55d1b376ea |
[ao][sparsity] Vectorized WeightNormSparsifier (#80059)
The previous implementation was using loops to compute the sparsity within a block in a mask, as well as across the mask blocks. This implements the vectorized version. ## Vectorization: A high level overview of the vectorization procedure falls into a two step process: ### Tensor-level masking A tensor-level masking is a mask generation routine that has a granularity of `sparse_block_shape`. That means that only patches of that shape can be considered sparse/dense. To vectorize: 1. Reshape the data such that one of the dimensions represents the patches of sparse_block_shape. 2. Create a mask of the same shape as the reshaped data 3. Find the smallest `k` elements in the the data, given the dimension of the sparse "patches". `k` represents a derived paramter specifying the sparsity level. 4. Apply the 0/1 to the patches in the mask 5. Reshape the mask back to the original dimensions Note: because the shape of the mask might not be multiple of the sparse_block_shape, we nudge the sshape of the mask, and truncate it afterwards. ## Block-level masking A block-level masking is a mask generation routine that concerns itself only with sparsity within a patch of shape `sparse_block_shape`. This is useful when block sparsity allows partial block sparsification. To vectorize: Overall the block-level masking follows the same routine as the tensor-level algorithm described above. One distinction is that when reshaping the data/mask tensors we aim for creating a dimension that captures the internals of each patch. For example, if a `sparse_block_shape` is `(2, 2)`, we want to reshape the data/mask into `(2, 2, -1)`. That allows us to sort the internal elements on the last axis, and zero-out the ones that obey the sparse logic. Differential Revision: [D37352494](https://our.internmc.facebook.com/intern/diff/D37352494/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37352494/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/80059 Approved by: https://github.com/jerryzh168 |
||
|
|
2e8b9c7785 |
[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275) |
||
|
|
5b011fc6eb |
Fix Undefined variable in QInterpolateBenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130 Approved by: https://github.com/malfet |
||
|
|
486572223b |
Fix command example (#72847)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847
Reviewed By: malfet
Differential Revision: D34260868
Pulled By: kit1980
fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7
(cherry picked from commit
|
||
|
|
c2c859bdf2 |
[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3 |
||
|
|
5c3529a86d |
[lint] small pass to make lint clean (#68367)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2 |
||
|
|
89c4e8c22b |
[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8 |
||
|
|
6900aacf54 |
[fbcode] Fix operator_benchmark with jit mode (#67382)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f |
||
|
|
d802877dfa |
speed up quantized interpolate for channels last (#66525)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459 |
||
|
|
b8e1999253 |
[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d |
||
|
|
e3af4be963 |
pytorch quantization ao migration phase 2: caffe2/benchmark (#65833)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 | xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116 |
||
|
|
aebde1bc2b |
deprecate device getter from torch.testing namespace (#63844)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732 |
||
|
|
6a6ee92e36 |
[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727 |
||
|
|
9c73a48ecf |
ND Embeddings benchmark - Standardize randomized inputs (#64707)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672 |
||
|
|
3fbb49e75d |
Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e |
||
|
|
956c8fa01e |
Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b |
||
|
|
7a15576a65 |
[quant] update FakeQuant modules to use tensor qparams (#61318)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318 Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator. Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU. Local benchmark P427668213 Before this change ``` Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_aminmax 2.57% 1.507ms 3.10% 1.819ms 36.371us 2.872ms 4.81% 2.872ms 57.446us 50 aten::fake_quantize_per_tensor_affine 1.04% 610.915us 3.60% 2.114ms 42.276us 472.896us 0.79% 2.698ms 53.962us 50 aten::fake_quantize_per_tensor_affine_cachemask 1.69% 993.626us 2.56% 1.503ms 30.058us 2.225ms 3.73% 2.225ms 44.504us 50 aten::is_nonzero 3.85% 2.258ms 19.68% 11.540ms 46.161us 2.168ms 3.63% 11.084ms 44.336us 250 aten::zeros_like 1.82% 1.064ms 6.65% 3.901ms 39.007us 1.531ms 2.57% 3.905ms 39.045us 100 aten::eq 13.80% 8.093ms 25.90% 15.189ms 37.972us 9.580ms 16.05% 15.566ms 38.914us 400 aten::item 5.67% 3.323ms 21.50% 12.607ms 36.019us 3.233ms 5.42% 12.167ms 34.762us 350 aten::zeros 0.94% 549.208us 2.93% 1.717ms 34.343us 688.928us 1.15% 1.695ms 33.894us 50 aten::le 2.52% 1.478ms 4.50% 2.641ms 26.411us 1.753ms 2.94% 2.845ms 28.448us 100 aten::rsub 1.04% 608.715us 2.44% 1.433ms 28.667us 532.000us 0.89% 1.418ms 28.353us 50 aten::max 1.54% 905.401us 4.62% 2.711ms 27.106us 847.488us 1.42% 2.697ms 26.969us 100 aten::ones 0.92% 542.159us 2.16% 1.266ms 25.324us 661.856us 1.11% 1.301ms 26.017us 50 aten::min 0.82% 479.167us 2.15% 1.258ms 25.160us 407.808us 0.68% 1.276ms 25.530us 50 aten::_local_scalar_dense 15.83% 9.284ms 15.83% 9.284ms 26.526us 8.934ms 14.97% 8.934ms 25.524us 350 aten::clamp 2.35% 1.378ms 4.21% 2.467ms 24.669us 1.546ms 2.59% 2.461ms 24.612us 100 aten::zero_ 2.53% 1.482ms 5.65% 3.316ms 22.108us 1.326ms 2.22% 3.380ms 22.531us 150 aten::maximum 3.08% 1.805ms 3.08% 1.805ms 18.052us 1.849ms 3.10% 1.849ms 18.494us 100 aten::minimum 1.33% 778.854us 1.33% 778.854us 15.577us 868.672us 1.46% 868.672us 17.373us 50 aten::round 1.36% 799.910us 1.36% 799.910us 15.998us 809.568us 1.36% 809.568us 16.191us 50 aten::copy_ 6.61% 3.878ms 6.61% 3.878ms 15.513us 4.036ms 6.76% 4.036ms 16.143us 250 aten::div 2.53% 1.483ms 2.53% 1.483ms 14.833us 1.535ms 2.57% 1.535ms 15.353us 100 aten::mul 2.44% 1.431ms 2.44% 1.431ms 14.314us 1.478ms 2.48% 1.478ms 14.782us 100 aten::detach 1.46% 855.670us 2.41% 1.411ms 14.110us 832.448us 1.39% 1.395ms 13.949us 100 aten::add 2.22% 1.301ms 2.22% 1.301ms 13.008us 1.383ms 2.32% 1.383ms 13.828us 100 aten::fill_ 4.18% 2.452ms 4.18% 2.452ms 12.262us 2.693ms 4.51% 2.693ms 13.463us 200 aten::sub 5.06% 2.967ms 5.06% 2.967ms 14.837us 2.675ms 4.48% 2.675ms 13.374us 200 aten::to 2.10% 1.230ms 3.65% 2.140ms 10.701us 1.310ms 2.20% 2.062ms 10.310us 200 aten::select 1.28% 749.144us 1.49% 874.227us 8.742us 863.232us 1.45% 863.232us 8.632us 100 detach 0.95% 555.326us 0.95% 555.326us 5.553us 562.496us 0.94% 562.496us 5.625us 100 aten::as_strided 0.40% 232.289us 0.40% 232.289us 1.161us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 2.93% 1.720ms 2.93% 1.720ms 3.439us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.04% 611.313us 1.04% 611.313us 2.038us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.75% 438.585us 1.77% 1.036ms 5.180us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.36% 799.442us 1.36% 799.442us 3.198us 0.000us 0.00% 0.000us 0.000us 250 --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 58.645ms Self CUDA time total: 59.674ms ``` After this change ``` test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::fake_quantize_per_tensor_affine 0.98% 505.210us 4.38% 2.259ms 45.187us 419.424us 0.78% 3.218ms 64.367us 50 aten::_aminmax 2.78% 1.434ms 3.42% 1.766ms 35.321us 2.825ms 5.27% 2.825ms 56.505us 50 aten::fake_quantize_per_tensor_affine_cachemask_tens... 2.38% 1.229ms 3.40% 1.754ms 35.083us 2.799ms 5.22% 2.799ms 55.979us 50 aten::rsub 0.94% 485.040us 5.02% 2.590ms 51.793us 458.976us 0.86% 2.587ms 51.747us 50 aten::is_nonzero 3.78% 1.952ms 23.64% 12.196ms 48.786us 2.055ms 3.83% 11.986ms 47.944us 250 aten::item 6.92% 3.572ms 19.86% 10.244ms 40.977us 3.670ms 6.85% 9.931ms 39.724us 250 aten::zeros_like 1.65% 848.874us 6.64% 3.426ms 34.260us 1.397ms 2.61% 3.572ms 35.717us 100 aten::zeros 0.85% 436.691us 3.00% 1.549ms 30.984us 551.936us 1.03% 1.576ms 31.516us 50 aten::eq 10.60% 5.467ms 20.26% 10.452ms 26.130us 7.018ms 13.09% 10.832ms 27.079us 400 aten::le 2.58% 1.332ms 4.67% 2.407ms 24.074us 1.580ms 2.95% 2.614ms 26.144us 100 aten::_local_scalar_dense 12.93% 6.673ms 12.93% 6.673ms 26.691us 6.261ms 11.68% 6.261ms 25.046us 250 aten::clamp 2.43% 1.253ms 4.37% 2.256ms 22.560us 1.431ms 2.67% 2.273ms 22.725us 100 aten::ones 0.89% 460.133us 2.18% 1.123ms 22.467us 570.496us 1.06% 1.128ms 22.551us 50 aten::min 0.74% 383.132us 2.06% 1.065ms 21.296us 377.536us 0.70% 1.091ms 21.824us 50 aten::zero_ 2.36% 1.219ms 5.87% 3.029ms 20.194us 1.261ms 2.35% 3.199ms 21.327us 150 aten::max 1.51% 779.081us 4.06% 2.096ms 20.960us 791.680us 1.48% 2.130ms 21.295us 100 aten::sub 7.97% 4.111ms 7.97% 4.111ms 20.556us 3.847ms 7.18% 3.847ms 19.234us 200 aten::div 2.94% 1.516ms 2.94% 1.516ms 15.158us 1.580ms 2.95% 1.580ms 15.798us 100 aten::round 1.45% 750.445us 1.45% 750.445us 15.009us 756.064us 1.41% 756.064us 15.121us 50 aten::copy_ 6.88% 3.548ms 6.88% 3.548ms 14.190us 3.701ms 6.90% 3.701ms 14.803us 250 aten::minimum 1.32% 681.654us 1.32% 681.654us 13.633us 713.664us 1.33% 713.664us 14.273us 50 aten::maximum 2.55% 1.317ms 2.55% 1.317ms 13.169us 1.338ms 2.50% 1.338ms 13.378us 100 aten::mul 2.63% 1.358ms 2.63% 1.358ms 13.581us 1.328ms 2.48% 1.328ms 13.283us 100 aten::detach 1.34% 688.820us 2.35% 1.211ms 12.110us 772.800us 1.44% 1.278ms 12.779us 100 aten::fill_ 4.53% 2.338ms 4.53% 2.338ms 11.692us 2.495ms 4.65% 2.495ms 12.473us 200 aten::add 2.32% 1.197ms 2.32% 1.197ms 11.968us 1.240ms 2.31% 1.240ms 12.405us 100 aten::to 2.07% 1.069ms 3.66% 1.889ms 9.443us 1.224ms 2.28% 1.975ms 9.874us 200 aten::select 1.44% 743.042us 1.64% 848.207us 8.482us 641.600us 1.20% 641.600us 6.416us 100 detach 1.01% 522.155us 1.01% 522.155us 5.222us 505.088us 0.94% 505.088us 5.051us 100 aten::as_strided 0.44% 227.884us 0.44% 227.884us 1.139us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 3.20% 1.652ms 3.20% 1.652ms 3.304us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.25% 646.711us 1.25% 646.711us 2.156us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.79% 407.768us 2.07% 1.067ms 5.334us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.52% 785.788us 1.52% 785.788us 3.143us 0.000us 0.00% 0.000us 0.000us 250 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 51.590ms Self CUDA time total: 53.609ms ghstack-source-id: 133370215 Test Plan: buck test mode/dev-nosan caffe2/test/:quantization Reviewed By: raghuramank100 Differential Revision: D29566512 fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c |
||
|
|
3176f16691 |
[Pytorch benchmark] Add BMM benchmark (#59595)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595 ghstack-source-id: 130946743 Test Plan: bmm_test Reviewed By: mingzhe09088 Differential Revision: D28873228 fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c |
||
|
|
8b63573c31 |
[PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334 Add gelu op benchmark ghstack-source-id: 130947172 Test Plan: gelu_test Reviewed By: hl475 Differential Revision: D28842959 fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4 |
||
|
|
277f587496 |
rename benchmark_cpp_extension (#58708)
Summary: Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708 Test Plan: Run from `./benchmarks/operator_benchmark/pt_extension` folder: ``` python setup.py install python cpp_extension_test.py ``` Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI Reviewed By: robieta Differential Revision: D28585582 Pulled By: walterddr fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57 |