pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
FFFrog	9a1cdcb8a0	Format: fixing multiple string concatenation in single line (#106013 ) Fixing multiple string concatenation in single line Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013 Approved by: https://github.com/albanD	2023-07-26 18:39:18 +00:00
Edward Z. Yang	dd3a77bc96	Apply UFMT to all files in benchmarks/ (#105928 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD	2023-07-26 01:18:48 +00:00
Justin Chu	5ef023b05a	[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet	2023-07-19 04:46:37 +00:00
Aaron Gokaslan	2f95a3d0fc	[BE]: Apply ruff PERF fixes to torch (#104917 ) Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-11 20:45:21 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Xuehai Pan	8d45f555d7	[BE] [1/3] Rewrite `super()` calls in caffe2 and benchmarks (#94587 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587 Approved by: https://github.com/ezyang	2023-02-11 18:19:48 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
sanchitintel	c4544bc169	Fix thread-allocation in `_vec_log_softmax_lastdim` (#85398 ) ## Problem history There seems to always have been a bug in `_vec_log_softmax_lastdim `. In particular, there were two issues with it - #### Bug 1 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`: `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();` It was `256` for float32, bfloat16, and float16. When AVX512 support was added, `CHUNK_SIZE` became `512`. The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it. #### Bug 2 `grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)` So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256. Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases. #### Problem caused by bugs With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`! When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700. In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`. AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads. ## Solution Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now, and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart). ## Benchmarks ##### Machine-config: Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake) One socket of 26 physical cores was used. Intel OpenMP & tcmalloc were preloaded. Example of a command to run benchmark: `ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu` Benchmark \| Old implementation time (us) \| New implementation time (us) \| Speedup ratio (old/new) -- \| -- \| -- \| -- LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 \| 11069.281 \| 2651.186 \| 4.17x LogSoftmax_N1024_seq_len23258_dim1_cpu AVX512 \| 18292.928 \| 2586.550\| 7.07x LogSoftmax_N700_seq_len23258_dim1_cpu AVX2 \| 9611.902 \| 1762.833 \| 5.452x LogSoftmax_N700_seq_len23258_dim1_cpu AVX512 \| 12168.371 \| 1717.824 \| 7.08x Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano	2023-02-07 15:09:05 +00:00
salilsdesai	323e0143d6	[Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918 ) @bypass-github-export-checks Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with The configs are copied from Conv2d with the kernel parameter removed. I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case. Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918 Approved by: https://github.com/mcr229	2023-01-10 21:36:37 +00:00
Sergii Dymchenko	30edd39bdc	Fix non-existing parameters in docstrings in benchmarks (#91115 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115 Approved by: https://github.com/clee2000	2022-12-20 02:07:32 +00:00
Kazuaki Ishizaki	14d5f139d2	Fix typos under benchmarks, test, and tools directories (#87975 ) This PR fixes typos in `.md` files under benchmarks, test, and tools directories Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975 Approved by: https://github.com/kit1980	2022-10-29 01:26:17 +00:00
Nicolas Hug	97de281176	Improve interpolate() speed for channels_last CPU images and masks (#86361 ) This PR improves the speed of `interpolate()`: - on CPU - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa	2022-10-11 16:17:36 +00:00
PyTorch MergeBot	3a2cfbb813	Revert "Improve interpolate() speed for channels_last images and masks (#86361 )" This reverts commit `93b2d99158`. Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process	2022-10-11 10:17:27 +00:00
Nicolas Hug	93b2d99158	Improve interpolate() speed for channels_last images and masks (#86361 ) This PR improves the speed of `interpolate()`: - on images and masks (`num_channels < 4`, `channels_last=True`) - for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float) - for both upsampling and downsampling The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size. In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details): ``` (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms ``` An immediate follow-up to this PR would be to do the same changes for the 3D kernels. Thanks a ton @fmassa for the help! ### Speedup benchmarks: Results: <details> ``` ---------------------------------------------------------------------------------------------------- (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=1 1.6X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 1.7X 1.0ms vs 0.5ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=1 8X 0.806ms vs 0.097ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.056ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=1 10X 0.828ms vs 0.084ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.914ms vs 0.057ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.900ms vs 0.086ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=2 1.6X 1.1ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=2 1.6X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 1.7X 0.6ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 1.7X 0.5ms vs 0.3ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=2 9X 0.800ms vs 0.088ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=2 11X 0.459ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=2 7X 0.424ms vs 0.064ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.503ms vs 0.043ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.461ms vs 0.059ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=12 3X 1.1ms vs 0.3ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=12 1.6X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=12 5X 0.8ms vs 0.2ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=12 10X 0.445ms vs 0.047ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=12 7X 0.432ms vs 0.062ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=12 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=12 7X 0.470ms vs 0.063ms (1, 3, 64, 64) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.4ms (1, 3, 64, 64) -> (224, 224) nearest float32 num_threads=32 1.8X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 1, 64, 64) -> (224, 224) linear float32 num_threads=32 11X 0.815ms vs 0.074ms (1, 1, 64, 64) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.045ms (1, 1, 64, 64) -> (224, 224) nearest uint8 num_threads=32 7X 0.436ms vs 0.061ms (1, 1, 64, 64) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 64, 64) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.061ms ---------------------------------------------------------------------------------------------------- (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=1 0.9X 0.9ms vs 1.1ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=1 1.5X 0.9ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 1.6X 1.0ms vs 0.6ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=1 8X 0.808ms vs 0.099ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=1 15X 0.848ms vs 0.058ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=1 9X 0.820ms vs 0.087ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=1 16X 0.909ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.898ms vs 0.088ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=2 1.4X 0.9ms vs 0.7ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=2 1.5X 0.5ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 1.5X 0.5ms vs 0.4ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=2 9X 0.799ms vs 0.090ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=2 10X 0.459ms vs 0.045ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=2 7X 0.427ms vs 0.059ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=2 11X 0.501ms vs 0.044ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=2 8X 0.460ms vs 0.060ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=12 2.9X 1.0ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=12 1.2X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 1.1X 0.2ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=12 12X 0.809ms vs 0.068ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=12 8X 0.432ms vs 0.055ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.480ms vs 0.041ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.464ms vs 0.056ms (1, 3, 128, 128) -> (224, 224) linear float32 num_threads=32 3X 1.1ms vs 0.3ms (1, 3, 128, 128) -> (224, 224) nearest float32 num_threads=32 1.3X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 1.4X 0.3ms vs 0.2ms (1, 3, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 128, 128) -> (224, 224) linear float32 num_threads=32 11X 0.813ms vs 0.075ms (1, 1, 128, 128) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest uint8 num_threads=32 7X 0.433ms vs 0.061ms (1, 1, 128, 128) -> (224, 224) nearest-exact float32 num_threads=32 10X 0.478ms vs 0.046ms (1, 1, 128, 128) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.062ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=1 0.9X 4.5ms vs 5.2ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=1 1.5X 4.2ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=1 1.8X 4.1ms vs 2.3ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 1.6X 4.5ms vs 2.8ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 1.9X 4.4ms vs 2.3ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=1 9X 3.8ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=1 17X 4.0ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=1 11X 3.9ms vs 0.4ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=1 19X 4.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=1 12X 4.3ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=2 1.5X 4.5ms vs 3.1ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=2 1.4X 2.3ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=2 1.7X 2.1ms vs 1.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 1.6X 2.5ms vs 1.6ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 1.8X 2.2ms vs 1.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=2 15X 3.8ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=2 15X 2.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=2 7X 2.0ms vs 0.3ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=2 16X 2.4ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=2 8X 2.2ms vs 0.3ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=12 8X 5.2ms vs 0.7ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=12 1.3X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 1.4X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=12 36X 3.9ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=12 10X 0.526ms vs 0.051ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=12 7X 0.514ms vs 0.069ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=12 11X 0.569ms vs 0.052ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=12 8X 0.557ms vs 0.070ms (1, 3, 224, 224) -> (600, 400) linear float32 num_threads=32 9X 4.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest float32 num_threads=32 0.5X 0.2ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (600, 400) linear float32 num_threads=32 44X 3.864ms vs 0.087ms (1, 1, 224, 224) -> (600, 400) nearest float32 num_threads=32 10X 0.527ms vs 0.053ms (1, 1, 224, 224) -> (600, 400) nearest uint8 num_threads=32 7X 0.516ms vs 0.070ms (1, 1, 224, 224) -> (600, 400) nearest-exact float32 num_threads=32 10X 0.567ms vs 0.055ms (1, 1, 224, 224) -> (600, 400) nearest-exact uint8 num_threads=32 8X 0.558ms vs 0.072ms ---------------------------------------------------------------------------------------------------- (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=1 1.0X 1.9ms vs 1.9ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=1 2.0X 1.8ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=1 1.7X 1.8ms vs 1.0ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 2.1X 1.9ms vs 0.9ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 1.9X 1.9ms vs 1.0ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=1 9X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=1 16X 1.7ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=1 10X 1.7ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=1 17X 1.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=1 11X 1.8ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=2 1.7X 1.9ms vs 1.1ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=2 2.0X 1.0ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=2 1.7X 0.9ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 2.3X 1.1ms vs 0.5ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 1.8X 1.0ms vs 0.5ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=2 8X 1.6ms vs 0.2ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=2 14X 0.931ms vs 0.067ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=2 7X 0.9ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=2 15X 1.016ms vs 0.069ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=2 9X 0.9ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=12 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=12 20X 1.630ms vs 0.081ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=12 10X 0.457ms vs 0.044ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=12 7X 0.439ms vs 0.060ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=12 11X 0.485ms vs 0.045ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=12 8X 0.474ms vs 0.061ms (1, 3, 256, 256) -> (320, 320) linear float32 num_threads=32 8X 1.9ms vs 0.3ms (1, 3, 256, 256) -> (320, 320) nearest float32 num_threads=32 2.0X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 1.4X 0.2ms vs 0.2ms (1, 3, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 256, 256) -> (320, 320) linear float32 num_threads=32 21X 1.628ms vs 0.078ms (1, 1, 256, 256) -> (320, 320) nearest float32 num_threads=32 9X 0.453ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest uint8 num_threads=32 7X 0.445ms vs 0.063ms (1, 1, 256, 256) -> (320, 320) nearest-exact float32 num_threads=32 11X 0.535ms vs 0.048ms (1, 1, 256, 256) -> (320, 320) nearest-exact uint8 num_threads=32 8X 0.502ms vs 0.063ms ---------------------------------------------------------------------------------------------------- (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=1 1.0X 13.8ms vs 14.0ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=1 1.8X 13.1ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=1 1.8X 11.1ms vs 6.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 1.9X 13.9ms vs 7.4ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 1.9X 11.8ms vs 6.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=1 10X 10.2ms vs 1.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=1 19X 10.8ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=1 11X 10.4ms vs 0.9ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=1 20X 11.6ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=1 12X 11.4ms vs 0.9ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=2 1.8X 13.7ms vs 7.7ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=2 2.6X 7.3ms vs 2.8ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=2 1.8X 5.6ms vs 3.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 1.9X 7.9ms vs 4.1ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 1.9X 6.0ms vs 3.1ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=2 18X 10.1ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=2 19X 5.8ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=2 10X 5.3ms vs 0.5ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=2 20X 6.3ms vs 0.3ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=2 11X 5.7ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=12 8X 13.8ms vs 1.6ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=12 2.9X 1.5ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=12 1.7X 1.0ms vs 0.5ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 1.5X 1.5ms vs 1.0ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 1.8X 1.0ms vs 0.6ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=12 80X 10.1ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=12 13X 0.928ms vs 0.072ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=12 8X 0.9ms vs 0.1ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=12 13X 1.001ms vs 0.074ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=12 9X 1.0ms vs 0.1ms (1, 3, 500, 500) -> (800, 800) linear float32 num_threads=32 18X 14.0ms vs 0.8ms (1, 3, 500, 500) -> (800, 800) nearest float32 num_threads=32 1.9X 1.0ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest uint8 num_threads=32 2.9X 0.7ms vs 0.2ms (1, 3, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 1.7X 0.9ms vs 0.6ms (1, 3, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 1.8X 0.4ms vs 0.2ms (1, 1, 500, 500) -> (800, 800) linear float32 num_threads=32 111X 10.254ms vs 0.092ms (1, 1, 500, 500) -> (800, 800) nearest float32 num_threads=32 14X 0.784ms vs 0.056ms (1, 1, 500, 500) -> (800, 800) nearest uint8 num_threads=32 7X 0.551ms vs 0.075ms (1, 1, 500, 500) -> (800, 800) nearest-exact float32 num_threads=32 11X 0.607ms vs 0.057ms (1, 1, 500, 500) -> (800, 800) nearest-exact uint8 num_threads=32 8X 0.596ms vs 0.076ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.077ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=1 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=1 1.0X 0.074ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=1 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=1 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=1 0.9X 0.078ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.076ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.075ms vs 0.074ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.082ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.080ms vs 0.083ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=2 1.0X 0.070ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=2 1.0X 0.073ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=2 1.0X 0.071ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=2 1.0X 0.079ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=2 1.0X 0.077ms vs 0.079ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.083ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.077ms vs 0.075ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.083ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=12 1.0X 0.071ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=12 1.0X 0.076ms vs 0.074ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=12 1.0X 0.073ms vs 0.071ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=12 1.0X 0.080ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=12 1.0X 0.080ms vs 0.078ms (1, 3, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.084ms vs 0.084ms (1, 3, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.078ms vs 0.077ms (1, 3, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.076ms vs 0.076ms (1, 3, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.083ms vs 0.083ms (1, 3, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.081ms vs 0.082ms (1, 1, 224, 224) -> (64, 64) linear float32 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest float32 num_threads=32 1.0X 0.074ms vs 0.075ms (1, 1, 224, 224) -> (64, 64) nearest uint8 num_threads=32 1.0X 0.072ms vs 0.072ms (1, 1, 224, 224) -> (64, 64) nearest-exact float32 num_threads=32 1.0X 0.077ms vs 0.080ms (1, 1, 224, 224) -> (64, 64) nearest-exact uint8 num_threads=32 1.0X 0.076ms vs 0.079ms ---------------------------------------------------------------------------------------------------- (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=1 1.8X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=1 1.6X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 2.0X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 1.7X 0.3ms vs 0.2ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=1 6X 0.265ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=1 10X 0.280ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=1 7X 0.273ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=1 11X 0.303ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=1 8X 0.297ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=2 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=2 1.8X 0.163ms vs 0.093ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=2 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 1.9X 0.180ms vs 0.096ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=2 6X 0.264ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=2 10X 0.278ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=2 7X 0.270ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=2 11X 0.298ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=2 8X 0.293ms vs 0.037ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=12 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=12 1.7X 0.158ms vs 0.095ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 1.7X 0.170ms vs 0.100ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=12 6X 0.269ms vs 0.043ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=12 11X 0.291ms vs 0.027ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=12 8X 0.281ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=12 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=12 8X 0.306ms vs 0.038ms (1, 3, 224, 224) -> (128, 128) linear float32 num_threads=32 1.5X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (128, 128) nearest float32 num_threads=32 1.6X 0.160ms vs 0.098ms (1, 3, 224, 224) -> (128, 128) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 1.7X 0.171ms vs 0.099ms (1, 3, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 224, 224) -> (128, 128) linear float32 num_threads=32 6X 0.269ms vs 0.044ms (1, 1, 224, 224) -> (128, 128) nearest float32 num_threads=32 10X 0.282ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest uint8 num_threads=32 7X 0.276ms vs 0.037ms (1, 1, 224, 224) -> (128, 128) nearest-exact float32 num_threads=32 11X 0.305ms vs 0.028ms (1, 1, 224, 224) -> (128, 128) nearest-exact uint8 num_threads=32 8X 0.299ms vs 0.038ms ---------------------------------------------------------------------------------------------------- (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=1 1.0X 1.2ms vs 1.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=1 2.0X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=1 1.7X 1.1ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 2.1X 1.2ms vs 0.6ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 1.9X 1.2ms vs 0.7ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=1 8X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=1 15X 1.109ms vs 0.073ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=1 10X 1.1ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=1 16X 1.192ms vs 0.074ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=1 11X 1.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=2 1.7X 1.2ms vs 0.7ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=2 2.0X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=2 1.7X 0.6ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 2.2X 0.7ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 1.8X 0.6ms vs 0.3ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=2 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=2 11X 0.598ms vs 0.052ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=2 8X 0.556ms vs 0.072ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=2 12X 0.649ms vs 0.053ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=2 8X 0.598ms vs 0.073ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=12 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=12 1.3X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=12 9X 1.0ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=12 12X 0.572ms vs 0.048ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=12 8X 0.560ms vs 0.068ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=12 13X 0.617ms vs 0.049ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=12 9X 0.604ms vs 0.068ms (1, 3, 320, 320) -> (256, 256) linear float32 num_threads=32 5X 1.2ms vs 0.3ms (1, 3, 320, 320) -> (256, 256) nearest float32 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 3, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 1.4X 0.2ms vs 0.1ms (1, 1, 320, 320) -> (256, 256) linear float32 num_threads=32 13X 1.042ms vs 0.081ms (1, 1, 320, 320) -> (256, 256) nearest float32 num_threads=32 12X 0.586ms vs 0.050ms (1, 1, 320, 320) -> (256, 256) nearest uint8 num_threads=32 8X 0.562ms vs 0.069ms (1, 1, 320, 320) -> (256, 256) nearest-exact float32 num_threads=32 12X 0.621ms vs 0.051ms (1, 1, 320, 320) -> (256, 256) nearest-exact uint8 num_threads=32 9X 0.609ms vs 0.070ms ---------------------------------------------------------------------------------------------------- (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=1 1.0X 1.0ms vs 1.0ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=1 1.9X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=1 1.7X 0.9ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 2.1X 1.0ms vs 0.5ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 1.8X 0.9ms vs 0.5ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=1 7X 0.8ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=1 14X 0.852ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=1 9X 0.828ms vs 0.087ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=1 15X 0.922ms vs 0.061ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=1 10X 0.897ms vs 0.087ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=2 1.9X 0.5ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=2 1.7X 0.4ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 2.1X 0.5ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 1.8X 0.5ms vs 0.3ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=2 10X 0.808ms vs 0.084ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=2 10X 0.462ms vs 0.046ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=2 7X 0.429ms vs 0.062ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=2 12X 0.504ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=2 7X 0.461ms vs 0.063ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=12 4X 1.0ms vs 0.2ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=12 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=12 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 1.9X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=12 12X 0.820ms vs 0.067ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=12 11X 0.438ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=12 8X 0.431ms vs 0.056ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=12 12X 0.482ms vs 0.041ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=12 8X 0.467ms vs 0.056ms (1, 3, 600, 400) -> (224, 224) linear float32 num_threads=32 4X 1.0ms vs 0.3ms (1, 3, 600, 400) -> (224, 224) nearest float32 num_threads=32 1.7X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest uint8 num_threads=32 1.5X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 600, 400) -> (224, 224) linear float32 num_threads=32 12X 0.824ms vs 0.070ms (1, 1, 600, 400) -> (224, 224) nearest float32 num_threads=32 10X 0.443ms vs 0.044ms (1, 1, 600, 400) -> (224, 224) nearest uint8 num_threads=32 7X 0.438ms vs 0.059ms (1, 1, 600, 400) -> (224, 224) nearest-exact float32 num_threads=32 11X 0.479ms vs 0.045ms (1, 1, 600, 400) -> (224, 224) nearest-exact uint8 num_threads=32 8X 0.470ms vs 0.059ms ---------------------------------------------------------------------------------------------------- (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=1 1.0X 4.7ms vs 4.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=1 2.0X 4.4ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=1 1.8X 4.3ms vs 2.5ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 2.1X 4.7ms vs 2.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 1.9X 4.6ms vs 2.5ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=1 9X 4.0ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=1 17X 4.2ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=1 11X 4.1ms vs 0.4ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=1 19X 4.6ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=1 12X 4.5ms vs 0.4ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=2 1.7X 4.7ms vs 2.7ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=2 2.1X 2.4ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=2 1.8X 2.2ms vs 1.3ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 2.3X 2.6ms vs 1.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 1.9X 2.3ms vs 1.3ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=2 15X 4.0ms vs 0.3ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=2 16X 2.3ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=2 9X 2.1ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=2 17X 2.5ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=2 10X 2.3ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=12 10X 4.7ms vs 0.5ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=12 1.7X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 1.9X 0.4ms vs 0.2ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 1.8X 0.4ms vs 0.2ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=12 41X 3.969ms vs 0.096ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=12 11X 0.545ms vs 0.051ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=12 8X 0.532ms vs 0.070ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=12 11X 0.590ms vs 0.052ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=12 8X 0.578ms vs 0.071ms (1, 3, 800, 800) -> (500, 500) linear float32 num_threads=32 17X 4.7ms vs 0.3ms (1, 3, 800, 800) -> (500, 500) nearest float32 num_threads=32 1.8X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest uint8 num_threads=32 2.0X 0.3ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 1.9X 0.2ms vs 0.1ms (1, 3, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 1.6X 0.2ms vs 0.1ms (1, 1, 800, 800) -> (500, 500) linear float32 num_threads=32 45X 4.028ms vs 0.090ms (1, 1, 800, 800) -> (500, 500) nearest float32 num_threads=32 10X 0.549ms vs 0.053ms (1, 1, 800, 800) -> (500, 500) nearest uint8 num_threads=32 7X 0.536ms vs 0.072ms (1, 1, 800, 800) -> (500, 500) nearest-exact float32 num_threads=32 11X 0.592ms vs 0.055ms (1, 1, 800, 800) -> (500, 500) nearest-exact uint8 num_threads=32 8X 0.581ms vs 0.074ms ``` </details> Code: <details> I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu', requires_grad=self.auto_set()) if channels_last: if input_image.ndim == 4: input_image = input_image.contiguous(memory_format=torch.channels_last) elif input_image.ndim == 5: input_image = input_image.contiguous(memory_format=torch.channels_last_3d) else: raise ValueError( f"Can not set channels_last to the input of {input_image.ndim} dims" ) align_corners = None if "nearest" in mode else False if mode == "linear": mode = { 3: 'linear', 4: 'bilinear', 5: 'trilinear', }[input_image.ndim] self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "align_corners": align_corners, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, align_corners): return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=align_corners) def make_config(): sizes = ( ((224, 224), (64, 64)), ((224, 224), (128, 128)), ((600, 400), (224, 224)), ((320, 320), (256, 256)), ((800, 800), (500, 500)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, HW1), HW2]) # 3 channels attrs.append([(1, 1, HW1), HW2]) # 1 channel attrs.append([(1, 3, HW2), HW1]) # 3 channels attrs.append([(1, 1, HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True], 'mode': ["linear", "nearest", "nearest-exact"], 'dtype': [torch.float, torch.uint8] }, tags=["short"], ) # Need to remove instances with both torch.int and linear # Note: this is naaaasty def get_mode(l): for d in l: if "mode" in d: return d["mode"] def get_dtype(l): for d in l: if "dtype" in d: return d["dtype"] config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)] return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` with ``` for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file ``` and this very ugly helper ```py import re with open("main") as f: main = f.readlines() with open("new") as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("channels_last=True, ", "") split = deets.split(",") size = ','.join(split[:-3]) mode, dtype, threads = split[-3:] deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("$.?$", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 10 == 0 and i % 40 != 0: print() if i % 40 == 0: print("-" 100) print(l) ``` </details> Closes https://github.com/pytorch/pytorch/issues/83840 When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361 Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa	2022-10-07 07:52:36 +00:00
Zafar	0e30da3f2f	[refactor] Renaming ao.sparsity to ao.pruning (#84867 ) `Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development. This change will reflect the changes in the documentation as well. TODO: - [ ] Change the tutorials - [ ] Confirm no bc-breakages - [ ] Reflect the changes in the trackers and RFC docs Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867 Approved by: https://github.com/supriyar	2022-10-07 00:58:41 +00:00
zaf	2f04ba2c7c	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:38 +00:00
zaf	d32a762147	[quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - [Documentation](docs/source/quantization-support.rst) @vkuzo - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10 - [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo - [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a - [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)! Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:34 +00:00
zaf	c92e5ac95b	[quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - Documentation @vkuzo - docs/source/conf.py - docs/source/quantization.rst - [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168 - [common test routine](test/quantization/ao_migration/common.py) @HDCharles - JIT stuff @jamesr66a - torch/csrc/jit/passes/hoist_conv_packed_params.cpp - torch/csrc/jit/passes/quantization/helper.h - torch/csrc/jit/serialization/import_source.cpp Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012/) Differential Revision: [D38926012](https://our.internmc.facebook.com/intern/diff/D38926012) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713 Approved by: https://github.com/jerryzh168	2022-08-25 16:50:33 +00:00
PyTorch MergeBot	6a9c02339d	Revert "[quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713 )" This reverts commit `432f037498`. Reverted https://github.com/pytorch/pytorch/pull/78713 on behalf of https://github.com/janeyx99 due to Reverting for breaking (trunk-only) ios build	2022-08-22 07:32:37 +00:00
PyTorch MergeBot	b1a7b67529	Revert "[quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714 )" This reverts commit `e6fb97d8ae`. Reverted https://github.com/pytorch/pytorch/pull/78714 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted	2022-08-22 07:30:48 +00:00
PyTorch MergeBot	4cbb1986fe	Revert "[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 )" This reverts commit `7cd2fa1d38`. Reverted https://github.com/pytorch/pytorch/pull/78716 on behalf of https://github.com/janeyx99 due to sorry, reverting so https://github.com/pytorch/pytorch/pull/78713 could be cleanly reverted	2022-08-22 07:23:24 +00:00
zaf	7cd2fa1d38	[quant][ao_migration] `torch.nn.qat` → `torch.ao.nn.qat` (#78716 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [X] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [X] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [X] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [X] [Current PR] `torch.nn.qat` → `torch.ao.nn.qat` - [X] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [X] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - None Differential Revision: [D36861197](https://our.internmc.facebook.com/intern/diff/D36861197/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36861197/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78716 Approved by: https://github.com/jerryzh168	2022-08-22 05:33:23 +00:00
zaf	e6fb97d8ae	[quant][ao_migration] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` (#78714 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [X] [Current PR] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - [Documentation](docs/source/quantization-support.rst) @vkuzo - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10 - [BC test](test/quantization/bc/test_backward_compatibility.py) @vkuzo - [IR emitter](torch/csrc/jit/frontend/ir_emitter.cpp) @jamesr66a - [JIT serialization](torch/csrc/jit/serialization/import_source.cpp) @IvanKobzarev @jamesr66a Differential Revision: [D36860660](https://our.internmc.facebook.com/intern/diff/D36860660/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36860660/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78714 Approved by: https://github.com/jerryzh168	2022-08-22 05:22:00 +00:00
zaf	432f037498	[quant][ao_migration] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` (#78713 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [X] [Current PR] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - Documentation @vkuzo - docs/source/conf.py - docs/source/quantization.rst - [quantize_fx](torch/ao/quantization/quantize_fx.py) @jerryzh168 - [common test routine](test/quantization/ao_migration/common.py) @HDCharles - JIT stuff @jamesr66a - torch/csrc/jit/passes/hoist_conv_packed_params.cpp - torch/csrc/jit/passes/quantization/helper.h - torch/csrc/jit/serialization/import_source.cpp Differential Revision: [D36860145](https://our.internmc.facebook.com/intern/diff/D36860145/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/78713 Approved by: https://github.com/jerryzh168	2022-08-22 01:38:55 +00:00
zaf	78c8a0d752	[quant][ao_migration] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` (#78712 ) Context: In order to avoid the cluttering of the `torch.nn` namespace the quantized modules namespace is moved to `torch.ao.nn`. The list of the `nn.quantized` files that are being migrated: - [ ] `torch.nn.quantized` → `torch.ao.nn.quantized` - [X] [Current PR] `torch.nn.quantized.functional` → `torch.ao.nn.quantized.functional` - [ ] `torch.nn.quantized.modules` → `torch.ao.nn.quantized.modules` - [ ] `torch.nn.quantized.dynamic` → `torch.ao.nn.quantized.dynamic` - [ ] `torch.nn.quantized._reference` → `torch.ao.nn.quantized._reference` - [ ] `torch.nn.quantizable` → `torch.ao.nn.quantizable` - [ ] `torch.nn.qat` → `torch.ao.nn.qat` - [ ] `torch.nn.qat.modules` → `torch.ao.nn.qat.modules` - [ ] `torch.nn.qat.dynamic` → `torch.ao.nn.qat.dynamic` - [ ] `torch.nn.intrinsic` → `torch.ao.nn.intrinsic` - [ ] `torch.nn.intrinsic.modules` → `torch.ao.nn.intrinsic.modules` - [ ] `torch.nn.intrinsic.qat` → `torch.ao.nn.intrinsic.qat` - [ ] `torch.nn.intrinsic.quantized` → `torch.ao.nn.intrinsic.quantized` - [ ] `torch.nn.intrinsic.quantized.modules` → `torch.ao.nn.intrinsic.quantized.modules` - [ ] `torch.nn.intrinsic.quantized.dynamic` → `torch.ao.nn.intrinsic.quantized.dynamic` Majority of the files are just moved to the new location. However, specific files need to be double checked: - [Documentation](docs/source/quantization-support.rst) @vkuzo - [Public API test list](test/allowlist_for_publicAPI.json) @peterbell10 Differential Revision: [D36792967](https://our.internmc.facebook.com/intern/diff/D36792967/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36792967/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/78712 Approved by: https://github.com/jerryzh168	2022-08-18 17:51:54 +00:00
Sergii Dymchenko	a7299647d3	Re-enable non-contig in qarithmetic_test.py (#82345 ) As https://github.com/pytorch/pytorch/issues/29435 is closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/82345 Approved by: https://github.com/clee2000, https://github.com/huydhn	2022-07-29 00:10:50 +00:00
zaf	55d1b376ea	[ao][sparsity] Vectorized WeightNormSparsifier (#80059 ) The previous implementation was using loops to compute the sparsity within a block in a mask, as well as across the mask blocks. This implements the vectorized version. ## Vectorization: A high level overview of the vectorization procedure falls into a two step process: ### Tensor-level masking A tensor-level masking is a mask generation routine that has a granularity of `sparse_block_shape`. That means that only patches of that shape can be considered sparse/dense. To vectorize: 1. Reshape the data such that one of the dimensions represents the patches of sparse_block_shape. 2. Create a mask of the same shape as the reshaped data 3. Find the smallest `k` elements in the the data, given the dimension of the sparse "patches". `k` represents a derived paramter specifying the sparsity level. 4. Apply the 0/1 to the patches in the mask 5. Reshape the mask back to the original dimensions Note: because the shape of the mask might not be multiple of the sparse_block_shape, we nudge the sshape of the mask, and truncate it afterwards. ## Block-level masking A block-level masking is a mask generation routine that concerns itself only with sparsity within a patch of shape `sparse_block_shape`. This is useful when block sparsity allows partial block sparsification. To vectorize: Overall the block-level masking follows the same routine as the tensor-level algorithm described above. One distinction is that when reshaping the data/mask tensors we aim for creating a dimension that captures the internals of each patch. For example, if a `sparse_block_shape` is `(2, 2)`, we want to reshape the data/mask into `(2, 2, -1)`. That allows us to sort the internal elements on the last axis, and zero-out the ones that obey the sparse logic. Differential Revision: [D37352494](https://our.internmc.facebook.com/intern/diff/D37352494/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37352494/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/80059 Approved by: https://github.com/jerryzh168	2022-07-12 19:16:44 +00:00
Norman Ponte	2e8b9c7785	[TorchArrow][AIBench] Add AIBench Metrics for TorchArrow Inference Benchmark Test (#75035 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75035 - modify `--ai_pep_format` to `--report_aibench` to better reflect underlying framework name change Reviewed By: tgalkovskyi Differential Revision: D35257017 fbshipit-source-id: 6c0a2e4585db928b029484d4b81165bfc99bff9f (cherry picked from commit 18f4962539ccb09a3c33b146206342ea3930f275)	2022-04-01 00:35:42 +00:00
Sergii Dymchenko	5b011fc6eb	Fix Undefined variable in QInterpolateBenchmark Pull Request resolved: https://github.com/pytorch/pytorch/pull/73130 Approved by: https://github.com/malfet	2022-03-09 00:14:15 +00:00
Sergii Dymchenko	486572223b	Fix command example (#72847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72847 Reviewed By: malfet Differential Revision: D34260868 Pulled By: kit1980 fbshipit-source-id: 1b225f3c2c7a822e44df4bbd91766e6533eab6d7 (cherry picked from commit `c9e874c4d8`)	2022-02-16 21:45:45 +00:00
Ben Koopman	c2c859bdf2	[quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66560 Test Plan: Imported from OSS Reviewed By: HDCharles Differential Revision: D31618282 Pulled By: b-koopman fbshipit-source-id: ebfe723cfc4004f413f157e65532d64e8d0274b3	2021-11-19 06:29:19 -08:00
Michael Suo	5c3529a86d	[lint] small pass to make lint clean (#68367 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68367 - bmm_test.py was using syntax not allowed in 3.6 - Some suppressions were not placed on the correct line. With this file, ``` lintrunner --paths-cmd='git grep -Il .' ``` passes successfully. Test Plan: Imported from OSS Reviewed By: janeyx99, mrshenli Differential Revision: D32436644 Pulled By: suo fbshipit-source-id: ae9300c6593d8564fb326822de157d00f4aaa3c2	2021-11-16 10:27:00 -08:00
Shashank Chaudhry	89c4e8c22b	[NOOP][clangformat][codemod] Enable CLANGFORMAT for some folders in caffe2/* (#67746 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67746 Test Plan: Visual inspection. Sandcastle. Reviewed By: zertosh Differential Revision: D31986646 fbshipit-source-id: 91885c20c3cead3853c49abb9fe0a94a67f33cc8	2021-11-03 12:23:14 -07:00
Bin Wen	6900aacf54	[fbcode] Fix operator_benchmark with jit mode (#67382 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67382 two simple updates: * fix running benchmark with --use_jit. Previously will fail with error torch.jit.frontend.UnsupportedNodeError: import statements aren't supported: File "/proc/self/fd/3/bmm_test.py", line 9 def __invoke_main(): import ctypes ~~~~~~ <--- HERE import ctypes.util import errno * add matmul to bmm benchmark as D31837588 Test Plan: buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:bmm_test -- --forward_only=True --mkl_num_threads=1 --omp_num_threads=1 --use_jit=True Reviewed By: ShijunK Differential Revision: D31960528 fbshipit-source-id: 84b892934149784d1b8a0f90b0233cc2f1cf1f5f	2021-10-28 08:48:10 -07:00
Vasiliy Kuznetsov	d802877dfa	speed up quantized interpolate for channels last (#66525 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66525 This should solve https://github.com/pytorch/pytorch/issues/60015 There were two `q_zero_point()` accesses inside a for loop which was expensive. Moving them to before the loop sped things up 10x for a microbenchmark. Test Plan: ``` // comment out benchmarks unrelated to original issue, for simplicity cd benchmarks/operator_benchmark python -m pt.qinterpolate_test // before: 2994 us // after: 324 us // full results: https://gist.github.com/vkuzo/cc5ef9526dc0cda170d6d63498c16453 ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D31592422 fbshipit-source-id: b6078ac1039573bbe545275f7aedfd580910b459	2021-10-14 08:11:26 -07:00
Alexandr Guzhva	b8e1999253	[quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66183 Add a GPU benchmark for fakeQuant, similar to #65241 ghstack-source-id: 139810414 Test Plan: https://pxl.cl/1QjJM Reviewed By: b-koopman Differential Revision: D31288158 fbshipit-source-id: 65526248b5c7b70f0bc32a86b08f50b4cbc7a83d	2021-10-06 08:07:42 -07:00
Vasiliy Kuznetsov	e3af4be963	pytorch quantization ao migration phase 2: caffe2/benchmark (#65833 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65833 Renames `torch.quantization` to `torch.ao.quantization` in `caffe2/benchmarks` folder. ``` find caffe2/benchmarks/ -type f -name "*.py" -print0 \| xargs -0 sed -i "s/torch\.quantization/torch.ao.quantization/g" ``` Test Plan: CI Reviewed By: z-a-f Differential Revision: D31275963 fbshipit-source-id: 8596bf28df5c3ad2c4490ac8abb285d6517c0116	2021-10-01 06:17:36 -07:00
Philip Meier	aebde1bc2b	deprecate device getter from `torch.testing` namespace (#63844 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63844 Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D31141433 Pulled By: mruberry fbshipit-source-id: a29331278ab99a19e225e2cb357458e3db4f9732	2021-09-29 02:40:52 -07:00
Ben Koopman	6a6ee92e36	[quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65241 Test Plan: Imported from OSS Reviewed By: jingsh Differential Revision: D31150087 Pulled By: b-koopman fbshipit-source-id: a00d4995841eee81305d0007c908473cc3d5a727	2021-09-27 16:01:49 -07:00
Eddie Ren	9c73a48ecf	ND Embeddings benchmark - Standardize randomized inputs (#64707 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64707 Use torch.randn instead of torch.from_numpy to generate the tensor Test Plan: buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test Reviewed By: jingsh Differential Revision: D30817302 fbshipit-source-id: 924c05517812b4b9f7df05a8999f9236cfe7b672	2021-09-13 06:47:35 -07:00
Eddie Ren	3fbb49e75d	Extend 2Dim embedding bag benchmarking to include 3Dim benchmarks (#64647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64647 Add support for benchmarking of 8 bit quantizations of N-D batched embeddings. Currently only works for 3Dim embeddings and still requires thought on ramping up from 3Dim to NDim. Test Plan: ```buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_pack_test``` Reviewed By: jingsh Differential Revision: D30770085 fbshipit-source-id: 26659020f3458991592065a05366bde0f060494e	2021-09-10 16:49:02 -07:00
Harut Movsisyan	956c8fa01e	Microbenchmarking matrix mult (einsum, torch.mult, torch.mm) (#63654 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63654 Test Plan: ``` > buck run mode/opt caffe2/benchmarks/operator_benchmark/pt:matrix_mult_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 27.970 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 41.830 # Benchmarking PyTorch: einsum_bmm # Mode: Eager # Name: einsum_bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 499.114 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B4_M5_N3_K2_cpu # Input: B: 4, M: 5, N: 3, K: 2, device: cpu Forward Execution Time (us) : 6.268 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B32_M25_N20_K30_cpu # Input: B: 32, M: 25, N: 20, K: 30, device: cpu Forward Execution Time (us) : 12.676 # Benchmarking PyTorch: bmm # Mode: Eager # Name: bmm_B128_M100_N120_K110_cpu # Input: B: 128, M: 100, N: 120, K: 110, device: cpu Forward Execution Time (us) : 438.219 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 7.657 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 18.523 # Benchmarking PyTorch: einsum_elementwise # Mode: Eager # Name: einsum_elementwise_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 55.103 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B4_M5_N3_cpu # Input: B: 4, M: 5, N: 3, device: cpu Forward Execution Time (us) : 2.501 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B32_M25_N20_cpu # Input: B: 32, M: 25, N: 20, device: cpu Forward Execution Time (us) : 10.589 # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_B100_M90_N110_cpu # Input: B: 100, M: 90, N: 110, device: cpu Forward Execution Time (us) : 50.102 Reviewed By: ajyu Differential Revision: D30455179 fbshipit-source-id: 9f2d92b2d2b860f41a8e59be2cc086d75b587f7b	2021-08-24 16:26:26 -07:00
Supriya Rao	7a15576a65	[quant] update FakeQuant modules to use tensor qparams (#61318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61318 Remove the `float()` and `int()` calls in the forward function so that we can directly use the tensor qparams in the fake_quantize operator. Calling `float()/int()` internally calls `item()` which can trigger a gpu-> cpu copy if the original tensors reside on GPU. Local benchmark P427668213 Before this change ``` Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_aminmax 2.57% 1.507ms 3.10% 1.819ms 36.371us 2.872ms 4.81% 2.872ms 57.446us 50 aten::fake_quantize_per_tensor_affine 1.04% 610.915us 3.60% 2.114ms 42.276us 472.896us 0.79% 2.698ms 53.962us 50 aten::fake_quantize_per_tensor_affine_cachemask 1.69% 993.626us 2.56% 1.503ms 30.058us 2.225ms 3.73% 2.225ms 44.504us 50 aten::is_nonzero 3.85% 2.258ms 19.68% 11.540ms 46.161us 2.168ms 3.63% 11.084ms 44.336us 250 aten::zeros_like 1.82% 1.064ms 6.65% 3.901ms 39.007us 1.531ms 2.57% 3.905ms 39.045us 100 aten::eq 13.80% 8.093ms 25.90% 15.189ms 37.972us 9.580ms 16.05% 15.566ms 38.914us 400 aten::item 5.67% 3.323ms 21.50% 12.607ms 36.019us 3.233ms 5.42% 12.167ms 34.762us 350 aten::zeros 0.94% 549.208us 2.93% 1.717ms 34.343us 688.928us 1.15% 1.695ms 33.894us 50 aten::le 2.52% 1.478ms 4.50% 2.641ms 26.411us 1.753ms 2.94% 2.845ms 28.448us 100 aten::rsub 1.04% 608.715us 2.44% 1.433ms 28.667us 532.000us 0.89% 1.418ms 28.353us 50 aten::max 1.54% 905.401us 4.62% 2.711ms 27.106us 847.488us 1.42% 2.697ms 26.969us 100 aten::ones 0.92% 542.159us 2.16% 1.266ms 25.324us 661.856us 1.11% 1.301ms 26.017us 50 aten::min 0.82% 479.167us 2.15% 1.258ms 25.160us 407.808us 0.68% 1.276ms 25.530us 50 aten::_local_scalar_dense 15.83% 9.284ms 15.83% 9.284ms 26.526us 8.934ms 14.97% 8.934ms 25.524us 350 aten::clamp 2.35% 1.378ms 4.21% 2.467ms 24.669us 1.546ms 2.59% 2.461ms 24.612us 100 aten::zero_ 2.53% 1.482ms 5.65% 3.316ms 22.108us 1.326ms 2.22% 3.380ms 22.531us 150 aten::maximum 3.08% 1.805ms 3.08% 1.805ms 18.052us 1.849ms 3.10% 1.849ms 18.494us 100 aten::minimum 1.33% 778.854us 1.33% 778.854us 15.577us 868.672us 1.46% 868.672us 17.373us 50 aten::round 1.36% 799.910us 1.36% 799.910us 15.998us 809.568us 1.36% 809.568us 16.191us 50 aten::copy_ 6.61% 3.878ms 6.61% 3.878ms 15.513us 4.036ms 6.76% 4.036ms 16.143us 250 aten::div 2.53% 1.483ms 2.53% 1.483ms 14.833us 1.535ms 2.57% 1.535ms 15.353us 100 aten::mul 2.44% 1.431ms 2.44% 1.431ms 14.314us 1.478ms 2.48% 1.478ms 14.782us 100 aten::detach 1.46% 855.670us 2.41% 1.411ms 14.110us 832.448us 1.39% 1.395ms 13.949us 100 aten::add 2.22% 1.301ms 2.22% 1.301ms 13.008us 1.383ms 2.32% 1.383ms 13.828us 100 aten::fill_ 4.18% 2.452ms 4.18% 2.452ms 12.262us 2.693ms 4.51% 2.693ms 13.463us 200 aten::sub 5.06% 2.967ms 5.06% 2.967ms 14.837us 2.675ms 4.48% 2.675ms 13.374us 200 aten::to 2.10% 1.230ms 3.65% 2.140ms 10.701us 1.310ms 2.20% 2.062ms 10.310us 200 aten::select 1.28% 749.144us 1.49% 874.227us 8.742us 863.232us 1.45% 863.232us 8.632us 100 detach 0.95% 555.326us 0.95% 555.326us 5.553us 562.496us 0.94% 562.496us 5.625us 100 aten::as_strided 0.40% 232.289us 0.40% 232.289us 1.161us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 2.93% 1.720ms 2.93% 1.720ms 3.439us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.04% 611.313us 1.04% 611.313us 2.038us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.75% 438.585us 1.77% 1.036ms 5.180us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.36% 799.442us 1.36% 799.442us 3.198us 0.000us 0.00% 0.000us 0.000us 250 --------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 58.645ms Self CUDA time total: 59.674ms ``` After this change ``` test_fake_quant_profiler (scripts.supriyar.benchmark.module_bench.ProfilerBench) ... ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::fake_quantize_per_tensor_affine 0.98% 505.210us 4.38% 2.259ms 45.187us 419.424us 0.78% 3.218ms 64.367us 50 aten::_aminmax 2.78% 1.434ms 3.42% 1.766ms 35.321us 2.825ms 5.27% 2.825ms 56.505us 50 aten::fake_quantize_per_tensor_affine_cachemask_tens... 2.38% 1.229ms 3.40% 1.754ms 35.083us 2.799ms 5.22% 2.799ms 55.979us 50 aten::rsub 0.94% 485.040us 5.02% 2.590ms 51.793us 458.976us 0.86% 2.587ms 51.747us 50 aten::is_nonzero 3.78% 1.952ms 23.64% 12.196ms 48.786us 2.055ms 3.83% 11.986ms 47.944us 250 aten::item 6.92% 3.572ms 19.86% 10.244ms 40.977us 3.670ms 6.85% 9.931ms 39.724us 250 aten::zeros_like 1.65% 848.874us 6.64% 3.426ms 34.260us 1.397ms 2.61% 3.572ms 35.717us 100 aten::zeros 0.85% 436.691us 3.00% 1.549ms 30.984us 551.936us 1.03% 1.576ms 31.516us 50 aten::eq 10.60% 5.467ms 20.26% 10.452ms 26.130us 7.018ms 13.09% 10.832ms 27.079us 400 aten::le 2.58% 1.332ms 4.67% 2.407ms 24.074us 1.580ms 2.95% 2.614ms 26.144us 100 aten::_local_scalar_dense 12.93% 6.673ms 12.93% 6.673ms 26.691us 6.261ms 11.68% 6.261ms 25.046us 250 aten::clamp 2.43% 1.253ms 4.37% 2.256ms 22.560us 1.431ms 2.67% 2.273ms 22.725us 100 aten::ones 0.89% 460.133us 2.18% 1.123ms 22.467us 570.496us 1.06% 1.128ms 22.551us 50 aten::min 0.74% 383.132us 2.06% 1.065ms 21.296us 377.536us 0.70% 1.091ms 21.824us 50 aten::zero_ 2.36% 1.219ms 5.87% 3.029ms 20.194us 1.261ms 2.35% 3.199ms 21.327us 150 aten::max 1.51% 779.081us 4.06% 2.096ms 20.960us 791.680us 1.48% 2.130ms 21.295us 100 aten::sub 7.97% 4.111ms 7.97% 4.111ms 20.556us 3.847ms 7.18% 3.847ms 19.234us 200 aten::div 2.94% 1.516ms 2.94% 1.516ms 15.158us 1.580ms 2.95% 1.580ms 15.798us 100 aten::round 1.45% 750.445us 1.45% 750.445us 15.009us 756.064us 1.41% 756.064us 15.121us 50 aten::copy_ 6.88% 3.548ms 6.88% 3.548ms 14.190us 3.701ms 6.90% 3.701ms 14.803us 250 aten::minimum 1.32% 681.654us 1.32% 681.654us 13.633us 713.664us 1.33% 713.664us 14.273us 50 aten::maximum 2.55% 1.317ms 2.55% 1.317ms 13.169us 1.338ms 2.50% 1.338ms 13.378us 100 aten::mul 2.63% 1.358ms 2.63% 1.358ms 13.581us 1.328ms 2.48% 1.328ms 13.283us 100 aten::detach 1.34% 688.820us 2.35% 1.211ms 12.110us 772.800us 1.44% 1.278ms 12.779us 100 aten::fill_ 4.53% 2.338ms 4.53% 2.338ms 11.692us 2.495ms 4.65% 2.495ms 12.473us 200 aten::add 2.32% 1.197ms 2.32% 1.197ms 11.968us 1.240ms 2.31% 1.240ms 12.405us 100 aten::to 2.07% 1.069ms 3.66% 1.889ms 9.443us 1.224ms 2.28% 1.975ms 9.874us 200 aten::select 1.44% 743.042us 1.64% 848.207us 8.482us 641.600us 1.20% 641.600us 6.416us 100 detach 1.01% 522.155us 1.01% 522.155us 5.222us 505.088us 0.94% 505.088us 5.051us 100 aten::as_strided 0.44% 227.884us 0.44% 227.884us 1.139us 0.000us 0.00% 0.000us 0.000us 200 aten::empty 3.20% 1.652ms 3.20% 1.652ms 3.304us 0.000us 0.00% 0.000us 0.000us 500 aten::resize_ 1.25% 646.711us 1.25% 646.711us 2.156us 0.000us 0.00% 0.000us 0.000us 300 aten::empty_like 0.79% 407.768us 2.07% 1.067ms 5.334us 0.000us 0.00% 0.000us 0.000us 200 aten::empty_strided 1.52% 785.788us 1.52% 785.788us 3.143us 0.000us 0.00% 0.000us 0.000us 250 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 51.590ms Self CUDA time total: 53.609ms ghstack-source-id: 133370215 Test Plan: buck test mode/dev-nosan caffe2/test/:quantization Reviewed By: raghuramank100 Differential Revision: D29566512 fbshipit-source-id: 1aefca51f99949da7334bcfe504848275c9f952c	2021-07-10 19:43:02 -07:00
Kimish Patel	3176f16691	[Pytorch benchmark] Add BMM benchmark (#59595 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595 ghstack-source-id: 130946743 Test Plan: bmm_test Reviewed By: mingzhe09088 Differential Revision: D28873228 fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c	2021-06-10 08:24:29 -07:00
Kimish Patel	8b63573c31	[PyTorch Operator Benchmark] gelu benchmark (#59334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334 Add gelu op benchmark ghstack-source-id: 130947172 Test Plan: gelu_test Reviewed By: hl475 Differential Revision: D28842959 fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4	2021-06-09 16:09:37 -07:00
Rong Rong (AI Infra)	277f587496	rename benchmark_cpp_extension (#58708 ) Summary: Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708 Test Plan: Run from `./benchmarks/operator_benchmark/pt_extension` folder: ``` python setup.py install python cpp_extension_test.py ``` Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI Reviewed By: robieta Differential Revision: D28585582 Pulled By: walterddr fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57	2021-05-24 11:04:02 -07:00

1 2 3 4 5 ...

327 Commits