pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Shijun Kong	6ae0a7c919	Add ReplaceNaN benchmark as baseline (#46685 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685 as title Test Plan: caffe2 ``` ./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: replace_nan WARNING: Logging before InitGoogleLogging() is written to STDERR W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: replace_nan_M16_N16_dtypefloat # Input: M: 16, N: 16, dtype: float Forward Execution Time (us) : 30.742 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M16_N16_dtypedouble # Input: M: 16, N: 16, dtype: double Forward Execution Time (us) : 29.135 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypefloat # Input: M: 64, N: 64, dtype: float Forward Execution Time (us) : 94.059 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 93.569 ``` Reviewed By: qizzzh, houseroad Differential Revision: D24448483 fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f	2020-10-22 19:12:14 -07:00
Yang Wang	920ec6651f	[OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694 For the op with parameters (e.g. conv), the jit mode run currently will raise an error of `RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module` Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par --use_jit Reviewed By: mingzhe09088 Differential Revision: D24451206 fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6	2020-10-22 11:10:28 -07:00
Shijun Kong	e5a2ba2ea1	Fix benchmark_caffe2 Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only. Test Plan: `buck run caffe2/benchmarks/operator_benchmark/c2:add_test` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 652.748 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 63.570 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: in ``` Reviewed By: qizzzh Differential Revision: D24448374 fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6	2020-10-22 08:09:06 -07:00
Mingzhe Li	8908f6ad8e	[op-bench] modify import path of configs (#46679 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679 Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue. ghstack-source-id: 114857978 Test Plan: waitforsandcastle Reviewed By: vkuzo Differential Revision: D24459631 fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5	2020-10-21 16:15:11 -07:00
Bugra Akyildiz	03c7d5be6b	Add operator benchmark for 4bit/8bit embedding lookups Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`. Test Plan: ``` buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test' ```` The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474 https://www.internalfb.com/intern/aibench/details/346463246 Reviewed By: radkris-git Differential Revision: D24268413 fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d	2020-10-15 13:51:32 -07:00
Supriya Rao	31888b2e77	[quant][pyper] Rename the sparse argument for embedding_bag ops (#46003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003 sparse is confusing because itt is used in training for sparse gradients Test Plan: Imported from OSS Reviewed By: radkris-git, qizzzh Differential Revision: D24178248 fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d	2020-10-08 16:15:28 -07:00
Shijun Kong	7d4f5060ad	Fix doc about operator benchmark (#45853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853 The method name in README is not consistent with actual implementation. Reviewed By: qizzzh Differential Revision: D24114849 fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883	2020-10-08 09:13:53 -07:00
Mingzhe Li	e829d4fba9	[op-bench] fix jit mode (#45774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774 Fix RuntimeError: No such operator operator_benchmark::_consume Test Plan: waitforsandcastle Reviewed By: ngimel Differential Revision: D24064982 fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c	2020-10-05 09:29:41 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
taivu	8722952dbd	Add benchmark for channel_shuffle operator (#43509 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D23299972 Pulled By: kimishpatel fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754	2020-09-02 08:15:19 -07:00
Supriya Rao	7024ce8a2c	[quant] Add benchmarks for quantized embeddingbag module (#43296 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296 Use common config for float and quantized embedding_bag modules Test Plan: ``` python -m pt.qembeddingbag_test Benchmarking PyTorch: qEmbeddingBag Mode: Eager Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu Forward Execution Time (us) : 35.738 Benchmarking PyTorch: qEmbeddingBag Mode: Eager Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu Forward Execution Time (us) : 62.708 python -m pt.embeddingbag_test Benchmarking PyTorch: embeddingbag Mode: Eager Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu Forward Execution Time (us) : 46.878 Benchmarking PyTorch: embeddingbag Mode: Eager Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu Forward Execution Time (us) : 103.904 ``` Imported from OSS Reviewed By: vkuzo Differential Revision: D23245531 fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8	2020-08-24 09:51:03 -07:00
Supriya Rao	4fc9e958c4	[quant] Add benchmakrs for embedding_bag coversion ops (#43291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291 Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops Test Plan: ``` python -m pt.qembedding_pack_tes ``` Imported from OSS Reviewed By: radkris-git Differential Revision: D23231641 fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd	2020-08-20 11:26:20 -07:00
Vasiliy Kuznetsov	5aa61afbfb	quant bench: update observer configs (#42956 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956 In preparation for observer perf improvement, cleans up the micro benchmarks: * disable CUDA for histogram observers (it's too slow) * add larger shapes for better representation of real workloads Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qobserver_test ``` Imported from OSS Reviewed By: supriyar Differential Revision: D23093996 fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a	2020-08-17 17:07:56 -07:00
Paul Shao	8b5642a786	Fix to Learnable Fake Quantization Op Benchmarking (#43018 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018 In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled. Test Plan: Use the following command to execute the benchmark test: `buck test mode/dev-nosan pt:quantization_test` Reviewed By: vkuzo Differential Revision: D23107846 fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39	2020-08-13 16:32:13 -07:00
Vasiliy Kuznetsov	57b056b5f2	align qlinear benchmark to linear benchmark (#42767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767 Same as previous PR, forcing the qlinear benchmark to follow the fp one Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.linear_test python -m pt.qlinear_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23013937 fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8	2020-08-11 10:35:16 -07:00
Vasiliy Kuznetsov	a7bdf575cb	align qconv benchmark to conv benchmark (#42761 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761 Makes the qconv benchmark follow the conv benchmark exactly. This way it will be easy to compare q vs fp with the same settings. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qconv_test python -m pt.conv_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23012533 fbshipit-source-id: af30ee585389395569a6322f5210828432963077	2020-08-11 10:33:19 -07:00
Paul Shao	d28639a080	Optimization with Backward Implementation of Learnable Fake Quantize Per Channel Kernel (CPU and GPU) (#42810 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810 In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis. In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance: Speedup from python operator: ~10x Speedup from original learnable kernel: ~5.4x Speedup from non-backprop kernel: ~1.8x Test Plan: To assert correctness of the new kernel, on a devvm, enter the command `buck test //caffe2/test:quantization -- learnable_backward_per_channel` To benchmark the operators, on a devvm, enter the command 1. Set the kernel size to 3x3x256x256 or a reasonable input size. 2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test` 3. The relevant outputs for CPU are as follows: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 989024.686 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 95654.079 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 176948.970 ``` 4. The relevant outputs for GPU are as follows: The relevant outputs are as follows Pre-optimization: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 6795.173 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 4321.351 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 1052.066 ``` Post-optimization: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 6737.106 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 2112.484 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 1078.79 Reviewed By: vkuzo Differential Revision: D22946853 fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd	2020-08-11 08:41:53 -07:00
Vasiliy Kuznetsov	faca3c43e6	fix celu in quantized benchmark (#42756 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756 Similar to ELU, CELU was also broken in the quantized benchmark, fixing. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qactivation_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23010863 fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b	2020-08-07 15:23:50 -07:00
Presley Graham	5ca08b8891	Add benchmark for calculate_qparams (#42138 ) Summary: Adds a benchmark for `HistogramObserver.calculate_qparams` to the quantized op benchmarks. The next diff in this stack adds a ~15x speedup for this benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42138 Test Plan: While in the folder `benchmarks/operator_benchmark`, the benchmark can be run using `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`. Benchmark results before speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 185818.566 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 165325.916 ``` Benchmark results after speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 12242.241 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 12655.354 ``` Reviewed By: supriyar Differential Revision: D22779291 Pulled By: durumu fbshipit-source-id: 1fe17d20eda5dd99e0e2590480142034c3574d4e	2020-08-06 11:10:12 -07:00
Vasiliy Kuznetsov	50f0d2b97d	quant: add q_batchnorm_1d op (#42491 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491 Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode hookup will be in a future PR, and graph mode should work after this PR. Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d because we convert back to contiguous memory format at the end, since channels_last is only defined for rank >= 4. If further optimization is needed, that can be a separate PR (will need the NHWC folks to see if there is a workaround). Meanwhile, having this is better than not having anything. Context: There have been both internal and external requests for various quantized BN1d use cases. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm // performance: // https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D22926254 fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12	2020-08-05 17:20:18 -07:00
Vasiliy Kuznetsov	153673c33b	fix quantized elu benchmark (#42318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318 We forgot to update this benchmark when quantized elu's signature changed to require observation, fixing. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qactivation_test ``` Imported from OSS Reviewed By: supriyar Differential Revision: D22845251 fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca	2020-07-30 14:57:12 -07:00
Paul Shao	01b794f169	Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974 In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels. Test Plan: Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm): - On a devvm, run the command `buck run pt:fake_quantize_learnable_test` - On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test` Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free): Each sample has dimensions 3x256x256; ### In microseconds (`1e-6` second), \| \| Python Module \| C++ Kernel \| Non-backprop C++ Kernel \| \|---------------------------\|---------------\|------------\|-------------------------\| \| Per Tensor CPU Forward \| 3112.666 \| 3270.740 \| 3596.864 \| \| Per Tensor Cuda Forward \| 797.258 \| 258.961 \| 133.953 \| \| Per Channel CPU Forward \| 6587.693 \| 6931.461 \| 6352.417 \| \| Per Channel Cuda Forward \| 1579.576 \| 555.723 \| 479.016 \| \| Per Tensor CPU Backward \| 72278.390 \| 22466.648 \| 12922.195 \| \| Per Tensor Cuda Backward \| 6512.280 \| 1546.218 \| 652.942 \| \| Per Channel CPU Backward \| 74138.545 \| 41212.777 \| 14131.576 \| \| Per Channel Cuda Backward \| 6795.173 \| 4321.351 \| 1052.066 \| Reviewed By: z-a-f Differential Revision: D22715683 fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd	2020-07-29 11:12:17 -07:00
Presley Graham	445e7eb01b	Add quantized CELU operator by adding additional parameters to quantized ELU (#39199 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39199 Test Plan: Imported from OSS Differential Revision: D21771202 Pulled By: durumu fbshipit-source-id: 910de6202fa3d5780497c5bf85208568a09297dd	2020-07-17 17:56:33 -07:00
Stanislau Hlebik	b774ce54f8	remediation of S205607 fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3	2020-07-17 17:19:47 -07:00
Stanislau Hlebik	8fdea489af	remediation of S205607 fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac	2020-07-17 17:17:03 -07:00
Paul Shao	b7147fe6d7	Learnable Fake Quantizer Benchmark Test (#41429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429 This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages. Test Plan: Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm): - On a devvm, run the command `buck run pt:fake_quantize_learnable_test` - On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test` Benchmark Results (Locally on CPU): Each sample has dimensions 3x256x256; Each batch has 16 samples (`N=16`) - Per Tensor Forward: 0.023688 sec/sample - Per Tensor Backward: 0.165926 sec/sample - Per Channel Forward: 0.040432 sec / sample - Per Channel Backward: 0.173528 sec / sample Reviewed By: vkuzo Differential Revision: D22535252 fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f	2020-07-15 14:00:20 -07:00
Peter Bell	dddac948a3	Add CUDA to pooling benchmark configs (#41438 ) Summary: Related to https://github.com/pytorch/pytorch/issues/41368 These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438 Reviewed By: zhangguanheng66 Differential Revision: D22540756 Pulled By: ezyang fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46	2020-07-15 10:51:43 -07:00
Wojciech Baranowski	20f3051f7d	[adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/40131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665 Differential Revision: D22463538 Pulled By: ezyang fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4	2020-07-14 21:51:40 -07:00
Mingzhe Li	4ddf27ba48	[op-bench] check device attribute in user inputs Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff. Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1 Reviewed By: ngimel Differential Revision: D22538252 fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3	2020-07-14 17:17:59 -07:00
Mingzhe Li	144f04e7ef	Fix qobserver test Summary: Change the device config in qobserver test to a string to honor --device flag. Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test -- --iterations 1 --device cpu Reviewed By: ngimel Differential Revision: D22536379 fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a	2020-07-14 15:47:03 -07:00
Xiaomeng Yang	80d5b3785b	Add torch.logit function (#41062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062 Add torch.logit function Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit" Reviewed By: hl475 Differential Revision: D22406912 fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606	2020-07-13 19:33:20 -07:00
Peter Bell	3dcc329746	Use tree-based sum for floats to avoid numerical instability (#39516 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234 This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous. e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16. This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases. WIP to write a vectorized version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516 Reviewed By: ezyang Differential Revision: D22106251 Pulled By: ngimel fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9	2020-06-24 17:06:38 -07:00
Wojciech Baranowski	43331609a4	Port addmm, addbmm, addr to ATen (CUDA) (#38421 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421 Differential Revision: D22138333 Pulled By: VitalyFedyunin fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700	2020-06-22 13:02:33 -07:00
Vasiliy Kuznetsov	e35199a691	observer bench: add CUDA (#39360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360 Makes the observer microbenchmarks also run on CUDA. This is useful now that QAT is supported in DDP and is more likely to be run on GPUs. Test Plan: ``` python -m pt.qobserver_test ``` Imported from OSS Differential Revision: D21828985 fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435	2020-06-05 14:18:32 -07:00
Edward Yang	da2004e132	Upgrade lint. (#39483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685	2020-06-04 12:56:43 -07:00
Nikita Shulga	c02e7c464a	Replace import cpp_benchmark with `torch.utils.cpp_benchmark` (#38832 ) Summary: Otherwise, I don't understand how those could have been invoked Also, what is the benefit of importing the same module twice? Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832 Differential Revision: D21675081 Pulled By: malfet fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421	2020-05-20 18:53:09 -07:00
Peter Bell	0a159b0a3a	Fix precision issues in CPU remainder (#38293 ) Summary: Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861. This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals. Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`. I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293 Differential Revision: D21539801 Pulled By: ezyang fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5	2020-05-14 08:54:32 -07:00
Supriya Rao	ae11718c45	[quant] Add quantized::conv1d op benchmarck (#38332 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332 Test Plan: python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0 Forward Execution Time (us) : 147.844 python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu Forward Execution Time (us) : 470.750 Imported from OSS Differential Revision: D21553662 fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387	2020-05-13 16:59:19 -07:00
Vasiliy Kuznetsov	4fa049c525	add quantized instancenorm operator (#36847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847 Adds a quantized instancenorm operator, which can reuse most of groupnorm's logic. Benchmarking shows that the quantized version is about 10x faster than floating point for equivalent input sizes (https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm ``` Imported from OSS Differential Revision: D21107925 fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4	2020-05-06 19:01:33 -07:00
Vasiliy Kuznetsov	b837d5d418	add quantized groupnorm operator (#36835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835 Adds a quantized groupnorm operator. We reuse most of the layernorm kernel, modifying it to be able to perform channel-wise scaling. Benchmark results: the quantized layer is between 6x to 15x faster from fp to q, depending on input shapes (full results: https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm ``` Numerics are nearly equivalent, with the only difference documented in the test case. The difference is the same type as with quantized layernorm. Making numerics equivalent is possible but will sacrifice speed. Imported from OSS Differential Revision: D21107926 fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45	2020-05-06 19:01:26 -07:00
Vasiliy Kuznetsov	2773ed3082	hardswish: remove unnecessary quantize call (#36980 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980 Missed this on the original diff, fixing. Create the output tensor directly instead of quantizing it. Test Plan: tests still pass microbenchmarks show a 2x performance improvment for int8: https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this will depend on input size) Imported from OSS Differential Revision: D21185970 fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2	2020-04-22 16:15:54 -07:00
Vasiliy Kuznetsov	13391cebe2	ai-pep: match the qlinear benchmark to linear (#36674 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674 Slight changes to qlinear benchmark to have it be in the same format as linear, for fairer comparisons between FP and Q. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.linear_test python -m pt.qlinear_test ``` Imported from OSS Differential Revision: D21102562 fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6	2020-04-20 09:46:32 -07:00
Vasiliy Kuznetsov	25649684ed	ai-pep: align qconv benchmark to conv (#36673 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673 Slight changes to the qconv benchmark to make it match the floating point benchmark, so we can compare across the two better. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qconv_test --tag_filter all python -m pt.conv_test --tag_filter all ``` Imported from OSS Differential Revision: D21102563 fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659	2020-04-20 09:44:09 -07:00
Vasiliy Kuznetsov	a5d0d762fa	redo of add quantized layer norm implementation (#36593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593 This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a better test. Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Differential Revision: D21030268 Pulled By: vkuzo fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f	2020-04-15 19:47:18 -07:00
Supriya Rao	73f11a0b23	Update qbatch_norm2d opbenchmark test (#36630 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630 Test Plan: OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test Imported from OSS Differential Revision: D21030508 fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91	2020-04-14 17:09:18 -07:00
Edward Yang	88c22070fe	Revert D20768930: add quantized layer norm implementation Test Plan: revert-hammer Differential Revision: D20768930 Original commit changeset: ddf8727e9840 fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97	2020-04-09 14:36:37 -07:00
Vasiliy Kuznetsov	f813e7184e	add quantized layer norm implementation (#35329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329 Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Imported from OSS Differential Revision: D20768930 fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028	2020-04-09 09:11:41 -07:00
Vasiliy Kuznetsov	cc78914755	qactivation_benchmarks: small bug fix (#35731 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731 Changes relu and relu6 to point to the functional implementations here. The previous behavior tested the time to create the module, but didn't actually run the function (I noticed this when adding the new input sizes and seeing the measured time not change). Test Plan: run the benchmark, the time now changes as expected with input size for these. Imported from OSS Differential Revision: D20875542 fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555	2020-04-06 15:02:33 -07:00
Vasiliy Kuznetsov	6405f26a02	add more quantized activation benchmarks and input sizes (#35729 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729 * there were a few quantized activations which had implementations but not benchmarks, adds them * adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations Test Plan: ``` python -m pt.qactivation_test ``` Imported from OSS Differential Revision: D20875544 fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d	2020-04-06 15:02:29 -07:00

1 2 3 4 5

246 Commits