pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Paul Shao	d28639a080	Optimization with Backward Implementation of Learnable Fake Quantize Per Channel Kernel (CPU and GPU) (#42810 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42810 In this diff, the original backward pass implementation is sped up by merging the 3 iterations computing dX, dScale, and dZeroPoint separately. In this case, a native loop is directly used on a byte-wise level (referenced by `strides`). In addition, vectorization is used such that scale and zero point are expanded to share the same shape and the element-wise corresponding values to X along the channel axis. In the benchmark test on the operators, for an input of shape `3x3x256x256`, we have observed the following improvement in performance: Speedup from python operator: ~10x Speedup from original learnable kernel: ~5.4x Speedup from non-backprop kernel: ~1.8x Test Plan: To assert correctness of the new kernel, on a devvm, enter the command `buck test //caffe2/test:quantization -- learnable_backward_per_channel` To benchmark the operators, on a devvm, enter the command 1. Set the kernel size to 3x3x256x256 or a reasonable input size. 2. Run `buck test //caffe2/benchmarks/operator_benchmark/pt:quantization_test` 3. The relevant outputs for CPU are as follows: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 989024.686 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 95654.079 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cpu_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 176948.970 ``` 4. The relevant outputs for GPU are as follows: The relevant outputs are as follows Pre-optimization: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 6795.173 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 4321.351 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 1052.066 ``` Post-optimization: ``` # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typepy_module # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: py_module Backward Execution Time (us) : 6737.106 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typelearnable_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: learnable_kernel Backward Execution Time (us) : 2112.484 # Benchmarking PyTorch: FakeQuantizePerChannelOpBenchmark # Mode: Eager # Name: FakeQuantizePerChannelOpBenchmark_N3_C3_H256_W256_cuda_op_typeoriginal_kernel # Input: N: 3, C: 3, H: 256, W: 256, device: cpu, op_type: original_kernel Backward Execution Time (us) : 1078.79 Reviewed By: vkuzo Differential Revision: D22946853 fbshipit-source-id: 1a01284641480282b3f57907cc7908d68c68decd	2020-08-11 08:41:53 -07:00
Vasiliy Kuznetsov	faca3c43e6	fix celu in quantized benchmark (#42756 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42756 Similar to ELU, CELU was also broken in the quantized benchmark, fixing. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qactivation_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23010863 fbshipit-source-id: 203e63f9cff760af6809f6f345b0d222dc1e9e1b	2020-08-07 15:23:50 -07:00
Presley Graham	5ca08b8891	Add benchmark for calculate_qparams (#42138 ) Summary: Adds a benchmark for `HistogramObserver.calculate_qparams` to the quantized op benchmarks. The next diff in this stack adds a ~15x speedup for this benchmark. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42138 Test Plan: While in the folder `benchmarks/operator_benchmark`, the benchmark can be run using `python -m benchmark_all_quantized_test --operators HistogramObserverCalculateQparams`. Benchmark results before speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 185818.566 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 165325.916 ``` Benchmark results after speedup: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_affine # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_affine Forward Execution Time (us) : 12242.241 # Benchmarking PyTorch: HistogramObserverCalculateQparams # Mode: Eager # Name: HistogramObserverCalculateQparams_C3_M512_N512_dtypetorch.quint8_cpu_qschemetorch.per_tensor_symmetric # Input: C: 3, M: 512, N: 512, dtype: torch.quint8, device: cpu, qscheme: torch.per_tensor_symmetric Forward Execution Time (us) : 12655.354 ``` Reviewed By: supriyar Differential Revision: D22779291 Pulled By: durumu fbshipit-source-id: 1fe17d20eda5dd99e0e2590480142034c3574d4e	2020-08-06 11:10:12 -07:00
Will Constable	65066d779b	Add fastrnns benchmark to CI and upload data to scribe (#42030 ) Summary: Run fastrnns benchmark using pytest-benchmark infra, then parse its json format and upload to scribe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/42030 Reviewed By: malfet Differential Revision: D22970270 Pulled By: wconstab fbshipit-source-id: 87da9b7ddf741da14b80d20779771d19123be3c5	2020-08-06 10:30:27 -07:00
Hameer Abbasi	3d46e02ea1	Add __torch_function__ for methods (#37091 ) Summary: According to pytorch/rfcs#3 From the goals in the RFC: 1. Support subclassing `torch.Tensor` in Python (done here) 2. Preserve `torch.Tensor` subclasses when calling `torch` functions on them (done here) 3. Use the PyTorch API with `torch.Tensor`-like objects that are _not_ `torch.Tensor` subclasses (done in https://github.com/pytorch/pytorch/issues/30730) 4. Preserve `torch.Tensor` subclasses when calling `torch.Tensor` methods. (done here) 5. Propagating subclass instances correctly also with operators, using views/slices/indexing/etc. (done here) 6. Preserve subclass attributes when using methods or views/slices/indexing. (done here) 7. A way to insert code that operates on both functions and methods uniformly (so we can write a single function that overrides all operators). (done here) 8. The ability to give external libraries a way to also define functions/methods that follow the `__torch_function__` protocol. (will be addressed in a separate PR) This PR makes the following changes: 1. Adds the `self` argument to the arg parser. 2. Dispatches on `self` as well if `self` is not `nullptr`. 3. Adds a `torch._C.DisableTorchFunction` context manager to disable `__torch_function__`. 4. Adds a `torch::torch_function_enabled()` and `torch._C._torch_function_enabled()` to check the state of `__torch_function__`. 5. Dispatches all `torch._C.TensorBase` and `torch.Tensor` methods via `__torch_function__`. TODO: - [x] Sequence Methods - [x] Docs - [x] Tests Closes https://github.com/pytorch/pytorch/issues/28361 Benchmarks in https://github.com/pytorch/pytorch/pull/37091#issuecomment-633657778 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37091 Reviewed By: ngimel Differential Revision: D22765678 Pulled By: ezyang fbshipit-source-id: 53f8aa17ddb8b1108c0997f6a7aa13cb5be73de0	2020-08-05 20:44:13 -07:00
Vasiliy Kuznetsov	50f0d2b97d	quant: add q_batchnorm_1d op (#42491 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42491 Hooks up quantized batchnorm_1d to the quantized_bn kernel. Eager mode hookup will be in a future PR, and graph mode should work after this PR. Note: currently the implementation is ~2x slower on the benchmark than q_batch_norm2d because we convert back to contiguous memory format at the end, since channels_last is only defined for rank >= 4. If further optimization is needed, that can be a separate PR (will need the NHWC folks to see if there is a workaround). Meanwhile, having this is better than not having anything. Context: There have been both internal and external requests for various quantized BN1d use cases. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d python test/test_quantization.py TestQuantizedOps.test_batch_norm_1d_2d_3d_relu python test/test_quantization.py TestQuantizeJitOps.test_qbatch_norm // performance: // https://gist.github.com/vkuzo/73a07c0f24c05f5804990d9ebfaecf5e ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D22926254 fbshipit-source-id: 2780e6a81cd13a7455f6ab6e5118c22850a97a12	2020-08-05 17:20:18 -07:00
Vasiliy Kuznetsov	153673c33b	fix quantized elu benchmark (#42318 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42318 We forgot to update this benchmark when quantized elu's signature changed to require observation, fixing. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qactivation_test ``` Imported from OSS Reviewed By: supriyar Differential Revision: D22845251 fbshipit-source-id: 1443f6f0deac695715b1f2bd47f0f22b96dc72ca	2020-07-30 14:57:12 -07:00
Paul Shao	01b794f169	Operator-level Benchmark Test for Per Tensor and Per Channel Fake Quantization (#41974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41974 In this diff, 2 new sets of benchmark tests are added to the `quantization` benchmark suite where operator-level benchmarking is conducted for the learnable Python operators, the learnable c++ kernels, and the original non-backprop c++ kernels. Test Plan: Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm): - On a devvm, run the command `buck run pt:fake_quantize_learnable_test` - On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test` Benchmark Results (On devGPU with 0% volatile utilization -- all GPUs are free): Each sample has dimensions 3x256x256; ### In microseconds (`1e-6` second), \| \| Python Module \| C++ Kernel \| Non-backprop C++ Kernel \| \|---------------------------\|---------------\|------------\|-------------------------\| \| Per Tensor CPU Forward \| 3112.666 \| 3270.740 \| 3596.864 \| \| Per Tensor Cuda Forward \| 797.258 \| 258.961 \| 133.953 \| \| Per Channel CPU Forward \| 6587.693 \| 6931.461 \| 6352.417 \| \| Per Channel Cuda Forward \| 1579.576 \| 555.723 \| 479.016 \| \| Per Tensor CPU Backward \| 72278.390 \| 22466.648 \| 12922.195 \| \| Per Tensor Cuda Backward \| 6512.280 \| 1546.218 \| 652.942 \| \| Per Channel CPU Backward \| 74138.545 \| 41212.777 \| 14131.576 \| \| Per Channel Cuda Backward \| 6795.173 \| 4321.351 \| 1052.066 \| Reviewed By: z-a-f Differential Revision: D22715683 fbshipit-source-id: 8be528b790663413cbeeabd4f68bbca00be052dd	2020-07-29 11:12:17 -07:00
Taylor Robie	fab1795577	move benchmark utils into torch namespace (#41506 ) Summary: Move the timing utils to `torch.utils._benchmark`. I couldn't figure out how to get setuptools to pick it up and put it under `torch` unless it is in the `torch` directory. (And I think it has to be for `setup.py develop` anyway.) I also modified the record function benchmark since `Timer` and `Compare` should always be available now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41506 Reviewed By: ngimel Differential Revision: D22601460 Pulled By: robieta fbshipit-source-id: 9cea7ff1dcb0bb6922c15b99dd64833d9631c37b	2020-07-23 09:48:39 -07:00
Presley Graham	445e7eb01b	Add quantized CELU operator by adding additional parameters to quantized ELU (#39199 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39199 Test Plan: Imported from OSS Differential Revision: D21771202 Pulled By: durumu fbshipit-source-id: 910de6202fa3d5780497c5bf85208568a09297dd	2020-07-17 17:56:33 -07:00
Stanislau Hlebik	b774ce54f8	remediation of S205607 fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3	2020-07-17 17:19:47 -07:00
Stanislau Hlebik	8fdea489af	remediation of S205607 fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac	2020-07-17 17:17:03 -07:00
Elias Ellison	728fd37d92	[JIT] make fastrnns runnable on cpu (#41483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41483 Reviewed By: gmagogsfm Differential Revision: D22580275 Pulled By: eellison fbshipit-source-id: f2805bc7fa8037cfde7862b005d2940add3ac864	2020-07-16 15:53:39 -07:00
Paul Shao	b7147fe6d7	Learnable Fake Quantizer Benchmark Test (#41429 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41429 This diff contains the benchmark test to evaluate the speed of executing the learnable fake quantization operator, both in the forward path and the backward path, with respect to both per tensor and per channel usages. Test Plan: Inside the path `torch/benchmarks/operator_benchmark` (The root directory will be `caffe2` inside `fbcode` if working on a devvm): - On a devvm, run the command `buck run pt:fake_quantize_learnable_test` - On a personal laptop, run the command `python3 -m pt.fake_quantize_learnable_test` Benchmark Results (Locally on CPU): Each sample has dimensions 3x256x256; Each batch has 16 samples (`N=16`) - Per Tensor Forward: 0.023688 sec/sample - Per Tensor Backward: 0.165926 sec/sample - Per Channel Forward: 0.040432 sec / sample - Per Channel Backward: 0.173528 sec / sample Reviewed By: vkuzo Differential Revision: D22535252 fbshipit-source-id: e8e953ff2de2107c6f2dde4c8d5627bdea67ef7f	2020-07-15 14:00:20 -07:00
Peter Bell	dddac948a3	Add CUDA to pooling benchmark configs (#41438 ) Summary: Related to https://github.com/pytorch/pytorch/issues/41368 These benchmarks support CUDA already so there is no reason for it not to be in the benchmark config. Pull Request resolved: https://github.com/pytorch/pytorch/pull/41438 Reviewed By: zhangguanheng66 Differential Revision: D22540756 Pulled By: ezyang fbshipit-source-id: 621eceff37377c1ab06ff7483b39fc00dc34bd46	2020-07-15 10:51:43 -07:00
Wojciech Baranowski	20f3051f7d	[adaptive_]max_pool{1,2,3}d: handle edge case when input is filled with -inf (#40665 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/40131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/40665 Differential Revision: D22463538 Pulled By: ezyang fbshipit-source-id: 7e08fd0205926911d45aa150012154637e64a8d4	2020-07-14 21:51:40 -07:00
Mingzhe Li	4ddf27ba48	[op-bench] check device attribute in user inputs Summary: The device attribute in the op benchmark can only include 'cpu' or 'cuda'. So adding a check in this diff. Test Plan: buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --warmup_iterations 1 --iterations 1 Reviewed By: ngimel Differential Revision: D22538252 fbshipit-source-id: 3e5af72221fc056b8d867321ad22e35a2557b8c3	2020-07-14 17:17:59 -07:00
Mingzhe Li	144f04e7ef	Fix qobserver test Summary: Change the device config in qobserver test to a string to honor --device flag. Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qobserver_test -- --iterations 1 --device cpu Reviewed By: ngimel Differential Revision: D22536379 fbshipit-source-id: 8926b2393be1f52f9183f8205959a3ff18e3ed2a	2020-07-14 15:47:03 -07:00
Xiaomeng Yang	80d5b3785b	Add torch.logit function (#41062 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41062 Add torch.logit function Test Plan: buck test mode/dev-nosan //caffe2/test:torch -- "logit" Reviewed By: hl475 Differential Revision: D22406912 fbshipit-source-id: b303374f4c68850eb7477eb0645546a24b844606	2020-07-13 19:33:20 -07:00
Ilia Cherniavskii	08227072e2	Benchmark RecordFunction overhead on some models (#40952 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40952 Adding a benchmark to measure RecordFunction overhead, currently on resnet50 and lstm models Test Plan: python benchmarks/record_function_benchmark/record_function_bench.py Benchmarking RecordFunction overhead for lstm_jit Running warmup... finished Running 100 iterations with RecordFunction... finished N = 100, avg. time: 251.970 ms, stddev: 39.348 ms Running 100 iterations without RecordFunction... finished N = 100, avg. time: 232.828 ms, stddev: 24.556 ms Reviewed By: dzhulgakov Differential Revision: D22368357 Pulled By: ilia-cher fbshipit-source-id: bff4f4e0e06fb80fdfcf85966c2468e48ed7bc98	2020-07-10 08:46:19 -07:00
Taylor Robie	2d98f8170e	Add option to warn if elements in a Compare table are suspect (#41011 ) Summary: This PR adds a `.highlight_warnings()` method to `Compare`, which will include a `(! XX%)` next to measurements with high variance to highlight that fact. For example: ``` [------------- Record function overhead ------------] \| lstm_jit \| resnet50_jit 1 threads: ------------------------------------------ with_rec_fn \| 650 \| 8600 without_rec_fn \| 660 \| 8000 2 threads: ------------------------------------------ with_rec_fn \| 360 \| 4200 without_rec_fn \| 350 \| 4000 4 threads: ------------------------------------------ with_rec_fn \| 250 \| 2100 without_rec_fn \| 260 \| 2000 8 threads: ------------------------------------------ with_rec_fn \| 200 (! 6%) \| 1200 without_rec_fn \| 210 (! 6%) \| 1100 16 threads: ----------------------------------------- with_rec_fn \| 220 (! 8%) \| 900 (! 5%) without_rec_fn \| 200 (! 5%) \| 1000 (! 7%) 32 threads: ----------------------------------------- with_rec_fn \| 1000 (! 7%) \| 920 without_rec_fn \| 1000 (! 6%) \| 900 (! 6%) Times are in milliseconds (ms). (! XX%) Measurement has high variance, where XX is the median / IQR * 100. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/41011 Differential Revision: D22412905 Pulled By: robieta fbshipit-source-id: 2c90e719d9a5a1c0267ed113dd1b1b1738fa8269	2020-07-07 09:39:22 -07:00
Taylor Robie	f3949794a3	Prototype benchmarking util (#38338 ) Summary: This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same. In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar) I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2) Key takeaways: 1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?) 2) There is an extra ~1.5 us overhead, which dominates small kernels. 3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer. Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338 Differential Revision: D21551048 Pulled By: robieta fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db	2020-06-30 11:31:27 -07:00
Peter Bell	3dcc329746	Use tree-based sum for floats to avoid numerical instability (#39516 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/38716, fixes https://github.com/pytorch/pytorch/issues/37234 This algorithm does the summation along a single axis with multiple "levels" of accumulator, each of which is designed to hold the sum of an order of magnitude more values than the previous. e.g. if there are 2^16 elements, the first level will hold the sum of 2^4 elements, and so on in increasing powers of 2: 2^4, 2^8, 2^12 and finally 2^16. This limits the differences in magnitude of the partial results being added together, and so we don't lose accuracy as the axis length increases. WIP to write a vectorized version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/39516 Reviewed By: ezyang Differential Revision: D22106251 Pulled By: ngimel fbshipit-source-id: b56de4773292439dbda62b91f44ff37715850ae9	2020-06-24 17:06:38 -07:00
Ilia Cherniavskii	d8c384544e	Destroy CUDA events after profiling (#39962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962 Adding a simple wrapper with ref count for cuda event and destroying cuda event after the last copy is destroyed Test Plan: CI cuda profiler tests Differential Revision: D22027092 Pulled By: ilia-cher fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472	2020-06-23 10:44:39 -07:00
Wojciech Baranowski	43331609a4	Port addmm, addbmm, addr to ATen (CUDA) (#38421 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421 Differential Revision: D22138333 Pulled By: VitalyFedyunin fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700	2020-06-22 13:02:33 -07:00
Vasiliy Kuznetsov	e35199a691	observer bench: add CUDA (#39360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360 Makes the observer microbenchmarks also run on CUDA. This is useful now that QAT is supported in DDP and is more likely to be run on GPUs. Test Plan: ``` python -m pt.qobserver_test ``` Imported from OSS Differential Revision: D21828985 fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435	2020-06-05 14:18:32 -07:00
Edward Yang	da2004e132	Upgrade lint. (#39483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685	2020-06-04 12:56:43 -07:00
Michael Voznesensky	fce01a9bab	[JIT] Make new zip serialization for torch save/load significantly (~70%) faster (#38379 ) Summary: Before: ``` 2020-05-11 18:31:41 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 17.8048762, "median": 17.458917 }, "Big Tensors Load": { "mean": 3.2556887, "median": 2.9668495000000004 }, "Small Tensors Save": { "mean": 4.0381357, "median": 3.9440125 }, "Small Tensors Load": { "mean": 5.8792499, "median": 5.603067 }, "benchmark_run_at": "2020-05-12T01:31:41" } ``` After ``` Use zipfile serialization: True 2020-05-12 20:15:32 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 4.7534657, "median": 4.646732 }, "Big Tensors Load": { "mean": 3.6001919, "median": 3.493285 }, "Small Tensors Save": { "mean": 4.1066924, "median": 4.1219255 }, "Small Tensors Load": { "mean": 6.3902358, "median": 6.36977 }, "benchmark_run_at": "2020-05-13T03:15:32" } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/38379 Differential Revision: D21779494 Pulled By: voznesenskym fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a	2020-05-29 01:56:18 -07:00
Nikita Shulga	c02e7c464a	Replace import cpp_benchmark with `torch.utils.cpp_benchmark` (#38832 ) Summary: Otherwise, I don't understand how those could have been invoked Also, what is the benefit of importing the same module twice? Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832 Differential Revision: D21675081 Pulled By: malfet fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421	2020-05-20 18:53:09 -07:00
Ilia Cherniavskii	a94fb71b12	Memory profiling (#37775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775 Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) ``` ``` --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Reviewed By: ngimel Differential Revision: D21384248 Pulled By: ilia-cher fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3	2020-05-19 15:48:48 -07:00
Peter Bell	0a159b0a3a	Fix precision issues in CPU remainder (#38293 ) Summary: Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861. This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals. Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`. I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293 Differential Revision: D21539801 Pulled By: ezyang fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5	2020-05-14 08:54:32 -07:00
Supriya Rao	ae11718c45	[quant] Add quantized::conv1d op benchmarck (#38332 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332 Test Plan: python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0 Forward Execution Time (us) : 147.844 python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu Forward Execution Time (us) : 470.750 Imported from OSS Differential Revision: D21553662 fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387	2020-05-13 16:59:19 -07:00
Mikhail Zolotukhin	9a2d8dfe63	[TensorExpr] Benchmarks: set up profiling executor and fuser according to the given arguments. (#38295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38295 Test Plan: Imported from OSS Differential Revision: D21525741 Pulled By: ZolotukhinM fbshipit-source-id: 8bf1d54da062c8e0653bb2cb627883ae4ed14774	2020-05-12 23:27:46 -07:00
Ilia Cherniavskii	facc5e0cc4	Make profiler thread local (#36291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291 Move profiler state to be a thread local property, reuse existing thread local propagation mechanism to ensure correct profiling of async tasks. This also makes push/pop callback thread safe and easier to use in e.g. distributed profilier Test Plan: USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install ./build/bin/test_jit ./build/bin/test_jit python test/test_autograd.py python test/test_jit.py Differential Revision: D20938501 Pulled By: ilia-cher fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845	2020-05-07 14:52:49 -07:00
Vasiliy Kuznetsov	4fa049c525	add quantized instancenorm operator (#36847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847 Adds a quantized instancenorm operator, which can reuse most of groupnorm's logic. Benchmarking shows that the quantized version is about 10x faster than floating point for equivalent input sizes (https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm ``` Imported from OSS Differential Revision: D21107925 fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4	2020-05-06 19:01:33 -07:00
Vasiliy Kuznetsov	b837d5d418	add quantized groupnorm operator (#36835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835 Adds a quantized groupnorm operator. We reuse most of the layernorm kernel, modifying it to be able to perform channel-wise scaling. Benchmark results: the quantized layer is between 6x to 15x faster from fp to q, depending on input shapes (full results: https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm ``` Numerics are nearly equivalent, with the only difference documented in the test case. The difference is the same type as with quantized layernorm. Making numerics equivalent is possible but will sacrifice speed. Imported from OSS Differential Revision: D21107926 fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45	2020-05-06 19:01:26 -07:00
Vasiliy Kuznetsov	2773ed3082	hardswish: remove unnecessary quantize call (#36980 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980 Missed this on the original diff, fixing. Create the output tensor directly instead of quantizing it. Test Plan: tests still pass microbenchmarks show a 2x performance improvment for int8: https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this will depend on input size) Imported from OSS Differential Revision: D21185970 fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2	2020-04-22 16:15:54 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
Vasiliy Kuznetsov	13391cebe2	ai-pep: match the qlinear benchmark to linear (#36674 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674 Slight changes to qlinear benchmark to have it be in the same format as linear, for fairer comparisons between FP and Q. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.linear_test python -m pt.qlinear_test ``` Imported from OSS Differential Revision: D21102562 fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6	2020-04-20 09:46:32 -07:00
Vasiliy Kuznetsov	25649684ed	ai-pep: align qconv benchmark to conv (#36673 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673 Slight changes to the qconv benchmark to make it match the floating point benchmark, so we can compare across the two better. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qconv_test --tag_filter all python -m pt.conv_test --tag_filter all ``` Imported from OSS Differential Revision: D21102563 fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659	2020-04-20 09:44:09 -07:00
Vasiliy Kuznetsov	a5d0d762fa	redo of add quantized layer norm implementation (#36593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593 This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a better test. Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Differential Revision: D21030268 Pulled By: vkuzo fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f	2020-04-15 19:47:18 -07:00
Supriya Rao	73f11a0b23	Update qbatch_norm2d opbenchmark test (#36630 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630 Test Plan: OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test Imported from OSS Differential Revision: D21030508 fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91	2020-04-14 17:09:18 -07:00
Hameer Abbasi	7c825bad10	[RELAND] Add __torch_function__ benchmarks (#36138 ) Summary: Re-land of https://github.com/pytorch/pytorch/issues/35530 and https://github.com/pytorch/pytorch/issues/34645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/36138 Differential Revision: D20893770 Pulled By: ezyang fbshipit-source-id: 75ab688a086f5fb87412a853df5246c0c39704ca	2020-04-10 09:14:31 -07:00
Edward Yang	88c22070fe	Revert D20768930: add quantized layer norm implementation Test Plan: revert-hammer Differential Revision: D20768930 Original commit changeset: ddf8727e9840 fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97	2020-04-09 14:36:37 -07:00
Vasiliy Kuznetsov	f813e7184e	add quantized layer norm implementation (#35329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329 Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Imported from OSS Differential Revision: D20768930 fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028	2020-04-09 09:11:41 -07:00
Shen Li	76c7652cc5	Add distributed data parallel benchmark tool (#35198 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198 The need for this tool was motivated by #28883. In the past, we have done ad-hoc benchmarking, but it's time for something more structured. It would be nice to add more model architectures so that we can get a full picture of the performance impact of a code change simply by running this suite a few times. Test Plan: Imported from OSS Differential Revision: D20591296 Pulled By: mrshenli fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49	2020-04-08 15:07:03 -07:00
Vasiliy Kuznetsov	cc78914755	qactivation_benchmarks: small bug fix (#35731 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731 Changes relu and relu6 to point to the functional implementations here. The previous behavior tested the time to create the module, but didn't actually run the function (I noticed this when adding the new input sizes and seeing the measured time not change). Test Plan: run the benchmark, the time now changes as expected with input size for these. Imported from OSS Differential Revision: D20875542 fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555	2020-04-06 15:02:33 -07:00
Vasiliy Kuznetsov	6405f26a02	add more quantized activation benchmarks and input sizes (#35729 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729 * there were a few quantized activations which had implementations but not benchmarks, adds them * adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations Test Plan: ``` python -m pt.qactivation_test ``` Imported from OSS Differential Revision: D20875544 fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d	2020-04-06 15:02:29 -07:00
Vasiliy Kuznetsov	b68c3827de	add benchmark for quantized batchnorm (#35389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35389 Adds a benchmark for quantized batchnorm, with the parameters the same compared to floating point batchnorm benchmark. Test Plan: run benchmarks https://gist.github.com/vkuzo/c49be58abdf0ff64797fab3936d0cb15 Imported from OSS Differential Revision: D20875543 fbshipit-source-id: ced89fbe2d18168e92950d0b74ca638aba54cd96	2020-04-06 15:01:05 -07:00
Mikhail Zolotukhin	9fe3b1857d	[TensorExpr] Fix imports in tensorexpr benchmarks. (#35830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35830 Test Plan: Imported from OSS Differential Revision: D20799464 Pulled By: ZolotukhinM fbshipit-source-id: 1b5981ad15042f601a9b6eb01a799cdf71200666	2020-04-01 14:23:33 -07:00
Michael Suo	6491bf2855	Revert D20777341: [pytorch][PR] Add __torch_function__ benchmarks. Test Plan: revert-hammer Differential Revision: D20777341 Original commit changeset: 6aaaf2a07553 fbshipit-source-id: 1c324f91f85ac624bf878297c96c682a46958954	2020-04-01 10:23:00 -07:00
Hameer Abbasi	8c534bb0bd	Add __torch_function__ benchmarks. (#35530 ) Summary: Since the last one was apparently reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35530 Differential Revision: D20777341 Pulled By: ezyang fbshipit-source-id: 6aaaf2a0755359074ae3d0efe32018d78dafe976	2020-04-01 06:30:17 -07:00
Bram Wasti	a3e10d2a17	Expose enablement of TensorExpr fuser as env variable (#35341 ) Summary: This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/ ``` PYTORCH_TENSOREXPR=1 python benchmark.py ``` This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser" Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341 Reviewed By: ZolotukhinM Differential Revision: D20676348 Pulled By: bwasti fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464	2020-03-26 14:31:57 -07:00
Alban Desmaison	4d39aeec27	Revert D20653072: [pytorch][PR] Add __torch_function__ benchmarks. Test Plan: revert-hammer Differential Revision: D20653072 Original commit changeset: e7e363f8a1b8 fbshipit-source-id: e75e4979399d6fee10e00a673ea45b9bcc0fd447	2020-03-26 13:36:59 -07:00
Hameer Abbasi	bf24753570	Add __torch_function__ benchmarks. (#34645 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34645 Differential Revision: D20653072 Pulled By: ezyang fbshipit-source-id: e7e363f8a1b84fc0c354586e266a695e4a2ea60e	2020-03-26 11:29:10 -07:00
Vasiliy Kuznetsov	f1efe51028	add quantized version of hardswish operator (#34820 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34820 Adds quantized version of hardswish, for common quantized operator coverage. Note: * we carry over scale and zero_point from the input to the output, because the range of the output is unbounded if x > 0 * we also skip the .out function to not allow the user to specify a custom scale+zp (flexible on this). Test Plan: ``` python test/test_quantized.py https://gist.github.com/vkuzo/f9b579315ed7f5fdb24839e3218d8465 ``` Imported from OSS Differential Revision: D20472905 fbshipit-source-id: 0f2a83e9f5f7b43485fa46caf30e756dc5d492a9	2020-03-24 15:16:58 -07:00
Vasiliy Kuznetsov	f3e9fa6122	add hardswish FP operator (#34747 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747 Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for common operator coverage, since this is widely used. A future PR will add the quantized version. CUDA is saved for a future PR as well. Test Plan: tests pass: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32 ``` microbenchmark: https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca (batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s) Imported from OSS Differential Revision: D20451404 fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2	2020-03-24 15:15:34 -07:00
Mikhail Zolotukhin	8998a1b3d3	Add tensorexpr benchmarks. (#35064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35064 Test Plan: Imported from OSS Differential Revision: D20543695 Pulled By: ZolotukhinM fbshipit-source-id: 1cf294ab19465cb93557c2b195252c739b40a0f7	2020-03-20 12:01:31 -07:00
Vasiliy Kuznetsov	bf41a7624e	fix missing comma in activation benchmarks (#35104 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35104 I missed this in https://github.com/pytorch/pytorch/pull/34959 after a rebase, fixing. Test Plan: running benchmarks no longer crashes CI Imported from OSS Differential Revision: D20560908 fbshipit-source-id: a5494e23953d3c9007e9874d673896291b5322e0	2020-03-20 11:36:05 -07:00
Vasiliy Kuznetsov	37b234a880	quantized hardsigmoid, take 2 (#34959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34959 Adds quantized implementation of hardsigmoid. Original PR was https://github.com/pytorch/pytorch/pull/34607 and had to be reverted for a test breakage, trying again. Test Plan: tests benchmarks Imported from OSS Differential Revision: D20514212 fbshipit-source-id: cc7ae3b67757e2dde5c313c05ce60a0f2625d961	2020-03-19 13:27:22 -07:00
Shen Li	95f1cb34b9	Revert D20480546: adds quantized implementation of hard sigmoid Test Plan: revert-hammer Differential Revision: D20480546 Original commit changeset: 9febcb44afd9 fbshipit-source-id: 4461b455e63448cf45237e23c988b492c3e0f1b0	2020-03-17 19:58:08 -07:00
Vasiliy Kuznetsov	58c5b6d306	adds quantized implementation of hard sigmoid (#34607 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34607 Adds quantized version of hardsigmoid activation. Note: not implementing the _ and .out versions is currently intended, because the implementation changes the scale and zp and it's nice to not allow the user to specify scale and zp. Lmk if we should handle this differently. Test Plan: tests benchmarks Imported from OSS Differential Revision: D20480546 fbshipit-source-id: 9febcb44afd920125ed2ca4900492f0b712078ea	2020-03-17 16:01:39 -07:00
Rohan Varma	1e140c353c	[profiler][rpc] fix a race condition in the profiler when multiple threads call (#33719 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33719 We were seeing a strange error where gathering profiler events (specifically `parse_cpu_trace` in `profiler.py`) would fail with the error: `IndexError: pop from empty list`. It turned out that this was because for one particular `Event`, there was a pop recorded but not a push. Instead of the `push` event being completely missing, it was overwritten by a completely different event. After a bunch of debugging, and trying several hypotheses, it turns out that this was a race condition in `RangeEventList::record`. What happened was that different threads would call into `RangeEventList::record` on the same event list instance, and one record would stomp over the data written by the other one. Somehow the data written was a valid `Event` so the error did not manifest itself until the profiler realized a `pop` was missing a matching `push` in the python code. I fixed this by adding a lock to serialize writes to `RangeEventList::record`. This PR also makes a small change to pass in the `RecordFunction` name into `popRange`. It makes the debugging easier when investigating the events recorded. Differential Revision: D20071125 fbshipit-source-id: 70b51a65bcb833a7c88b7462a978fd3a39265f7e	2020-03-16 18:41:16 -07:00
Vasiliy Kuznetsov	1bac5fd0d3	add hardsigmoid FP operator to PyTorch (#34545 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545 This is for common operator coverage, since this is widely used. A future PR will add the quantized version. Some initial questions for reviewers, since it's my first FP operator diff: * do we need a backwards.out method for this? * do we need CUDA? If yes, should it be this PR or is it ok to split Test Plan: ``` // test python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32 // benchmark python -m pt.hardsigmoid_test ... Forward Execution Time (us) : 40.315 Forward Execution Time (us) : 42.603 ``` Imported from OSS Differential Revision: D20371692 fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327	2020-03-16 15:24:12 -07:00
Mikhail Zolotukhin	976d6aaa51	Revert D20251830: [TensorExpr] Add tensorexpr benchmarks. Test Plan: revert-hammer Differential Revision: D20251830 Original commit changeset: bafd66ce32f6 fbshipit-source-id: d8aea4b26441d8aba90c11d7350d3424df494052	2020-03-16 13:20:16 -07:00
Mikhail Zolotukhin	e93e7b2795	[TensorExpr] Add tensorexpr benchmarks. (#34230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34230 This PR adds some benchmarks that we used to assess tensor expressions performance. Differential Revision: D20251830 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: bafd66ce32f63077e3733112d854f5c750d5b1af	2020-03-16 11:49:39 -07:00
Vasiliy Kuznetsov	43c9cc7a9c	add quantized ELU activation (#34267 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34267 Adds quantized ELU. Test Plan: ``` python test/test_quantized.py TestQuantizedOps.test_qelu ``` still need to benchmark, saving that for after the review comments Imported from OSS Differential Revision: D20370953 fbshipit-source-id: fe941bf966f72dd9eee2c4b2ef45fe7afb50c866	2020-03-12 09:31:00 -07:00
Vasiliy Kuznetsov	2e88a78d2e	add quantized_hardtanh (#34097 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34097 Adds quantized hardtanh. Calls the clamp kernel behind the scenes. Test Plan: ``` python test/test_quantized.py ``` Imported from OSS Differential Revision: D20208860 fbshipit-source-id: 165a6a1c22f1dcc479679e5ea0c990d0e9c3b6c5	2020-03-10 22:27:15 -07:00
Wojciech Baranowski	b10a39bb32	Migrate _cat from TH to ATen (CUDA) (#33237 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24520 Benchmarks: Upstream: ``` $ python -m pt.cat_test --tag_filter all --device cuda --omp_num_threads 1 --mkl_num_threads 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 17.355 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 30.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 17.329 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 30.176 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 74.417 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 75.728 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 190.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa8876fcf28>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa8876fcf28>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 57.711 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7fa886237048>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7fa886237048>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 49.903 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7fa7b57bb840>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7fa7b57bb840>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 84.181 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bba60>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bba60>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 82.339 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7fa7b57bbae8>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7fa7b57bbae8>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 82.312 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7fa7b57bbb70>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7fa7b57bbb70>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 90.715 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 129.021 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 142.966 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 387.023 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbbf8>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbbf8>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 36.647 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbc80>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbc80>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 278.890 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbd08>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbd08>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 557.752 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbd90>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbd90>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 842.512 ``` New version: ``` $ python -m pt.cat_test --tag_filter all --device cuda --omp_num_threads 1 --mkl_num_threads 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 24.419 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 25.025 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 24.247 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 25.098 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 74.441 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 74.866 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 189.280 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1c9b056048>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1c9b056048>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 57.629 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1c9b0560d0>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1c9b0560d0>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 49.975 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bce8f38c8>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bce8f38c8>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 83.643 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3ae8>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3ae8>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 82.307 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bce8f3b70>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bce8f3b70>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 82.323 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bce8f3bf8>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bce8f3bf8>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 90.549 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 129.022 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 142.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 386.973 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3c80>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3c80>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 43.800 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3d08>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3d08>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 279.023 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3d90>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3d90>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 565.790 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3e18>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3e18>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 845.153 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/33237 Differential Revision: D20069181 Pulled By: ngimel fbshipit-source-id: b392e1ffd72c0d8df0c5a2d3ac96f59b37c84e32	2020-02-24 17:41:16 -08:00
comet	9a2691f2fc	Fix spelling errors Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32673 Differential Revision: D19597118 Pulled By: pietern fbshipit-source-id: f88c1da7548fcee141ed248f5f49d25c1d639955	2020-01-28 04:46:15 -08:00
Huamin Li	52f8f031ac	add diag into pt operator microbenchmark (#32597 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32597 Currently, there is no benchmark test about diag operator. This diff will add one into the suite. Test Plan: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim1_M64_N64_diagonal0_outTrue_cpu # Input: dim: 1, M: 64, N: 64, diagonal: 0, out: True, device: cpu Forward Execution Time (us) : 28.496 # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim2_M128_N128_diagonal-10_outFalse_cpu # Input: dim: 2, M: 128, N: 128, diagonal: -10, out: False, device: cpu Forward Execution Time (us) : 45.179 # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim1_M256_N256_diagonal20_outTrue_cpu # Input: dim: 1, M: 256, N: 256, diagonal: 20, out: True, device: cpu Forward Execution Time (us) : 49.009 ``` Reviewed By: mingzhe09088 Differential Revision: D19564024 fbshipit-source-id: 828a3e0e0e06810a77eb5ddb734efd30e4a63acf	2020-01-24 15:41:04 -08:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Zafar Takhirov	0ae063d5d9	Fixed concatenation benchmark + added it to the microbenchmarking runs Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31587 Test Plan: Imported from OSS Differential Revision: D19221813 Pulled By: z-a-f fbshipit-source-id: ee0eb60da7899b23fdc63326302d1e2fd4b540ee	2020-01-03 11:23:12 -08:00
olramde	d770fbc1d2	Some modifications to improve readability (#31352 ) Summary: In the long string, formalstring thinks it is good to have a name. When using dict, literal is better for readability and faster than dict constructor. I always appreciate your efforts in creating the world's best frameworks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31352 Differential Revision: D19191967 Pulled By: ngimel fbshipit-source-id: 21f063b163b67de8cf9761a4db5991f74318e991	2020-01-02 12:48:34 -08:00
Zafar Takhirov	e33dea6e4e	dynamicly quantized lstm benchmarking Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30149 Test Plan: Imported from OSS Differential Revision: D18613005 Pulled By: z-a-f fbshipit-source-id: 966bfe2c862b1b4006b228bd9115c5c1cd3ad8cf	2019-12-17 16:52:04 -08:00
Mingzhe Li	f9010d7648	remove wipe cache from op bench (#31334 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31334 The wipe cache logic was introduced hoping to reduce the variations in the benchmark results. Based on our experiments result, it didn't actually help with that. In addition, several engineers had encountered the issue of missing cpuinfo.h which was used in the wipe cache logic. So this diff removes that feature to ensure smooth installation and running of the op bench. Test Plan: ``` buck run caffe2/benchmarks/operator_benchmark/pt:add_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_M1_N1_K1_cpu # Input: M: 1, N: 1, K: 1, device: cpu Forward Execution Time (us) : 111.192 A/B test also pass Benchmark Run #2476535015 Reviewed By: hl475 Differential Revision: D19126970 fbshipit-source-id: 9b1ab48c121838836ba6e0ae664a48fe2d18efdd	2019-12-16 16:34:14 -08:00
Mingzhe Li	c6a8f884d8	add copy_ operator the op bench (#31327 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31327 Adds copy_ operator to the benchmark suite Test Plan: ``` buck run caffe2/benchmarks/operator_benchmark/pt:binary_test -- --iterations 1 --operators copy_ # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: copy_ # Mode: Eager # Name: copy__M1_N1_K1_cpu_dtype_onetorch.int32_dtype_twotorch.int32 # Input: M: 1, N: 1, K: 1, device: cpu, dtype_one: torch.int32, dtype_two: torch.int32 Forward Execution Time (us) : 60.645 Reviewed By: hl475 Differential Revision: D19122910 fbshipit-source-id: e5f0b0e2612daae0201b1b4a87f52b971e0cc4a8	2019-12-16 13:45:12 -08:00
Mingzhe Li	d401ba1417	benchmark binary ops in binary_test (#31326 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31326 as title Test Plan: ``` buck run caffe2/benchmarks/operator_benchmark/pt:binary_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_in_one[64,1,64]_in_two[1,64,1]_cpu_dtypetorch.float32 # Input: in_one: [64, 1, 64], in_two: [1, 64, 1], device: cpu, dtype: torch.float32 Forward Execution Time (us) : 28080.802 Reviewed By: hl475 Differential Revision: D19120113 fbshipit-source-id: 1105de208f7609cc6d74f0b5bc6fe75f19146b28	2019-12-16 13:45:08 -08:00
Zafar Takhirov	efe683fb2a	dynamicly quantized linear benchmarking Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30148 Test Plan: Imported from OSS Differential Revision: D18613006 Pulled By: z-a-f fbshipit-source-id: 3851189a2822fd09a5dd97c9d54774727822d2bf	2019-12-11 18:39:57 -08:00
TH3CHARLie	5edfe9cb80	add torch.square (#30719 ) Summary: fixes https://github.com/pytorch/pytorch/issues/30524 This adds an new operator `torch.square` to PyTorch I think it is ready for the first-time review now albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/30719 Differential Revision: D18909268 Pulled By: albanD fbshipit-source-id: 5626c445d8db20471a56fc1d7a3490e77812662b	2019-12-10 15:22:46 -08:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Mingzhe Li	b68d1fc316	add small input shapes to some ops (#30617 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30617 as title Test Plan: buck run //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --operator add,as_strided,cat,chunk,fill,linear,matmul,split Reviewed By: hl475 Differential Revision: D18764248 fbshipit-source-id: 510cf83542822acfa1b7b5e475b0cc7432f7ac19	2019-12-02 10:46:43 -08:00
Mingzhe Li	1aa80471b8	minor fix to filter (#30200 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30200 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 1 --ai_pep_format True --operators None --iterations -1 --warmup_iterations -1 --wipe_cache --forward_only False --device cpu --tag_filter all --use_jit False --operator_range b-z # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: batchnorm PyTorchObserver {"type": "PyTorch_batchnorm_M1_N256_K3136_cpu_Eager", "metric": "latency", "unit": "ms", "value": "0.29026457108557224"} PyTorchObserver {"type": "PyTorch_batchnorm_M1_N256_K3136_cpu_Eager", "metric": "latency", "unit": "ms", "value": "0.2813781425356865"} PyTorchObserver {"type": "PyTorch_batchnorm_M1_N256_K3136_cpu_Eager", "metric": "latency", "unit": "ms", "value": "0.28009670320898294"} ... Reviewed By: hl475 Differential Revision: D18627512 fbshipit-source-id: 23f622b96168f90a8d8648bfd9ff9a5116baafdf	2019-11-20 16:36:04 -08:00
Mingzhe Li	9cb8fb61c2	update operator_range discription in op bench (#30170 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30170 as title Test Plan: ``` buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/benchmark_all_other_test.par --tag_filter all --iterations 1 --operator_range ef ... ValueError: The correct format for operator_range is <start>-<end>, or <point>, <start>-<end> buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/benchmark_all_other_test.par --tag_filter all --iterations 1 --operator_range a-b # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8_N32_K256_cpu # Input: M: 8, N: 32, K: 256, device: cpu Forward Execution Time (us) : 60.551 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8_N32_K256_cuda # Input: M: 8, N: 32, K: 256, device: cuda Forward Execution Time (us) : 67.716 ... buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/benchmark_all_other_test.par --tag_filter all --iterations 1 --operator_range b,d-f # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: batchnorm # Mode: Eager # Name: batchnorm_M1_N256_K3136_cpu # Input: M: 1, N: 256, K: 3136, device: cpu Forward Execution Time (us) : 296.004 ... Reviewed By: hl475 Differential Revision: D18619975 fbshipit-source-id: 08f27ee2aeda47be431385f4b20ef7fbeb797516	2019-11-20 12:07:14 -08:00
Mingzhe Li	d11dfd1a84	only run embeddingbag op on cpu (#30163 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30163 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 1 --device cuda --operators embeddingbag Parsing buck files: finished in 0.9 sec Building: finished in 02:32.5 min (100%) 7358/7358 jobs, 1 updated Total time: 02:33.5 min # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 1 --operators embeddingbag Parsing buck files: finished in 0.9 sec Building: finished in 5.3 sec (100%) 5604/5604 jobs, 0 updated Total time: 6.3 sec # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: embeddingbag # Mode: Eager # Name: embeddingbag_embeddingbags80_dim64_modesum_input_size8_offset0_sparseTrue_cpu # Input: embeddingbags: 80, dim: 64, mode: sum, input_size: 8, offset: 0, sparse: True, device: cpu Forward Execution Time (us) : 62.608 ... Reviewed By: hl475 Differential Revision: D18617540 fbshipit-source-id: 062dd73c455db8b67749078603745651b55254b2	2019-11-20 10:02:39 -08:00
Mingzhe Li	2b1466e665	allow operator_range to take multiple ranges (#30124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30124 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 1 --device cuda --operator_range a,b-c # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8_N32_K256_cuda # Input: M: 8, N: 32, K: 256, device: cuda Forward Execution Time (us) : 71.683 # Benchmarking PyTorch: batchnorm # Mode: Eager # Name: batchnorm_M1_N256_K3136_cuda # Input: M: 1, N: 256, K: 3136, device: cuda Forward Execution Time (us) : 118.840 # Benchmarking PyTorch: batchnorm # Mode: Eager # Name: batchnorm_M1_N8192_K1_cuda # Input: M: 1, N: 8192, K: 1, device: cuda Forward Execution Time (us) : 134.274 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_M128_N128_K1_dim1_cuda # Input: M: 128, N: 128, K: 1, dim: 1, device: cuda Forward Execution Time (us) : 109.172 ... Reviewed By: hl475 Differential Revision: D18605640 fbshipit-source-id: 4ae9b91a50c4cdf1b161b6c5c58f365ba514050c	2019-11-19 16:15:46 -08:00
Mingzhe Li	0ab03d3283	only run embeddingbag benchmark on cpu (#30106 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30106 as title Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_other_test -- --tag_filter all --iterations 1 --device cuda --operators embeddingbag # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all Reviewed By: hl475 Differential Revision: D18598198 fbshipit-source-id: 9b7d103410f1183fdf6776047ea2ef8dba4b7831	2019-11-19 12:07:34 -08:00
Mingzhe Li	23991e89cc	change operator_range to work with lower and upper in op bench (#30096 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30096 as title Test Plan: ``` buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_quantized_test -- --iterations 1 --operator_range a-a # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_N2_dtypetorch.quint8_contigTrue # Input: N: 2, dtype: torch.quint8, contig: True Forward Execution Time (us) : 22.251 # Benchmarking PyTorch: add # Mode: Eager # Name: add_N2_dtypetorch.qint8_contigTrue # Input: N: 2, dtype: torch.qint8, contig: True Forward Execution Time (us) : 17.247 # Benchmarking PyTorch: add # Mode: Eager # Name: add_N2_dtypetorch.qint32_contigTrue # Input: N: 2, dtype: torch.qint32, contig: True Forward Execution Time (us) : 29.653 ... Reviewed By: hl475 Differential Revision: D18596447 fbshipit-source-id: eac8d9d90db244aa9799293c22bb0d30cf3edf58	2019-11-19 11:01:02 -08:00
Mingzhe Li	1597f22982	fix device check in op bench (#30091 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30091 as title Test Plan: ``` Before: buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:unary_test -- --device cuda # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 91.190 # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M512_N512_cuda # Input: M: 512, N: 512, device: cuda Forward Execution Time (us) : 27.062 After: # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M512_N512_cuda # Input: M: 512, N: 512, device: cuda Forward Execution Time (us) : 28.154 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M512_N512_cuda # Input: M: 512, N: 512, device: cuda Forward Execution Time (us) : 15.959 ... Reviewed By: hl475 Differential Revision: D18595176 fbshipit-source-id: 048c5b7b2a5318c3687412e12e8d2d5f380a8139	2019-11-19 10:05:47 -08:00
Mingzhe Li	5b15f32697	rename benchmark_all_other_test (#30048 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30048 as title (Note: this ignores all push blocking failures!) Test Plan: ``` buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_other_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cpu # Input: M: 64, N: 64, K: 64, device: cpu Forward Execution Time (us) : 142.032 ... Reviewed By: hl475 Differential Revision: D18580754 fbshipit-source-id: 125482d2987cbdb1d019ccedf56a9da5a7cebaba	2019-11-18 21:39:31 -08:00
Mingzhe Li	8b9bac1fad	add operator-range argument to the op bench (#30051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30051 This argument takes hyphen delimited start and end chars to filter operators. If the first character of an operator is in the start and end range, it will be tested. Otherwise skipped. (Note: this ignores all push blocking failures!) Test Plan: ``` buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --operator_range b-c # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: ceil # Mode: Eager # Name: ceil_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 110.720 # Benchmarking PyTorch: ceil_ # Mode: Eager # Name: ceil__M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 51.128 ... buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --operator_range None # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 107.113 # Benchmarking PyTorch: abs_ # Mode: Eager # Name: abs__M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 54.259 ... Reviewed By: hl475 Differential Revision: D18581910 fbshipit-source-id: b1a1a7ba76f4d6a61c8a1659f15e9c66097654d4	2019-11-18 20:34:43 -08:00
Mingzhe Li	64706e0a74	change conv, batchnorm input shapes (#30041 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30041 as title (Note: this ignores all push blocking failures!) Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:conv_test # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : None # Benchmarking PyTorch: ConvTranspose2d # Mode: Eager # Name: ConvTranspose2d_in_c512_out_c512_kernel3_stride2_N8_H64_W64_cpu # Input: in_c: 512, out_c: 512, kernel: 3, stride: 2, N: 8, H: 64, W: 64, device: cpu Forward Execution Time (us) : 751635.354 Reviewed By: hl475 Differential Revision: D18579767 fbshipit-source-id: 53bfac704828a836412434a66000c17f6ac1c727	2019-11-18 20:34:28 -08:00
Mingzhe Li	3250d5008f	change the starting iters to reduce execution time (#30040 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30040 The benchmark will run each test in a loop of 200 iters, then keep doubling the number of iters until the time is significant. For operators which have very large input shapes, the initial 200 iters will take too much time which is not really necessary. This diff changed that 200 to 100. (Note: this ignores all push blocking failures!) Test Plan: ``` Before # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : None # Benchmarking PyTorch: ConvTranspose2d # Mode: Eager # Name: ConvTranspose2d_in_c512_out_c512_kernel3_stride2_N8_H64_W64_cpu # Input: in_c: 512, out_c: 512, kernel: 3, stride: 2, N: 8, H: 64, W: 64, device: cpu Forward Execution Time (us) : 729634.577 After # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : None # Benchmarking PyTorch: ConvTranspose2d # Mode: Eager # Name: ConvTranspose2d_in_c512_out_c512_kernel3_stride2_N8_H64_W64_cpu # Input: in_c: 512, out_c: 512, kernel: 3, stride: 2, N: 8, H: 64, W: 64, device: cpu Forward Execution Time (us) : 718315.899 Reviewed By: hl475 Differential Revision: D18579588 fbshipit-source-id: ef52474cf77e7549bbab0a9ae7b1b0c04023d208	2019-11-18 20:34:16 -08:00
Mingzhe Li	189b24ebe9	reorganize test binaries of op bench (#30023 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30023 This diff doesn't change how users run the benchmarks. But under the hood, we group all the tests into three groups: unary test, quantized test, and the rest ops (we name it others here). Test Plan: ``` buck run //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: abs # Mode: Eager # Name: abs_M512_N512_cpu # Input: M: 512, N: 512, device: cpu Forward Execution Time (us) : 17914.301 ... # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cpu_bwd2 # Input: M: 64, N: 64, K: 64, device: cpu Backward Execution Time (us) : 66525.855 ... # Benchmarking PyTorch: mul # Mode: Eager # Name: mul_N2_dtypetorch.qint32_contigTrue # Input: N: 2, dtype: torch.qint32, contig: True Forward Execution Time (us) : 290.555 ... Reviewed By: hl475 Differential Revision: D18574719 fbshipit-source-id: f7ff1d952031129adde51ebf002e4891bd484680	2019-11-18 12:21:26 -08:00
Mingzhe Li	c543034531	add cuda sync when ops running on gpu (#29936 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29936 This diff adds synchronization after op execution to ensure all the cuda streams complete. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cpu # Input: M: 64, N: 64, K: 64, device: cpu Forward Execution Time (us) : 154.412 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cuda # Input: M: 64, N: 64, K: 64, device: cuda Forward Execution Time (us) : 101.115 ... Reviewed By: hl475 Differential Revision: D18542732 fbshipit-source-id: b979d26a174f488e971074dc1e16b00e17179c80	2019-11-15 18:02:48 -08:00
Mingzhe Li	3f5dc95b57	fix device check in op bench (#29918 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29918 Some of the tests don't specify `device` in the input configs so filter by device won't work for them. This diff fixes that issue. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:qpool_test -- --iterations 1 --device cpu # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: QAdaptiveAvgPool2dBenchmark # Mode: Eager # Name: QAdaptiveAvgPool2dBenchmark_N4_C3_input_size(224,224)_output_size(112,112)_contigTrue_dtypetorch.qint32 # Input: N: 4, C: 3, input_size: (224, 224), output_size: (112, 112), contig: True, dtype: torch.qint32 Forward Execution Time (us) : 2891.172 Reviewed By: hl475 Differential Revision: D18535766 fbshipit-source-id: 09d89cf23b3caab6c0bc3b8a9ae55cc439b98e0f	2019-11-15 13:55:38 -08:00
Mingzhe Li	60a33cac2b	reduce input shapes of long tag in op bench (#29865 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29865 For some operators, the number of tests (forward + backward) could easily go above 100. Many of them could be redundant so this diff tries to reduce the number of shapes. Test Plan: ``` buck run //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cpu # Input: M: 64, N: 64, K: 64, device: cpu Forward Execution Time (us) : 28418.926 ... Reviewed By: hl475 Differential Revision: D18520946 fbshipit-source-id: 1056d6d5a9c46bc2d508ff133039aefeb9d11c27	2019-11-14 20:19:09 -08:00
Mingzhe Li	90e3bbf3ab	support all with tag_filter to run all shapes (#29864 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29864 This diff make `all` as a reserved keyword for tag_filter. When `all` is passed from user, it will run all the supported shapes. Test Plan: ``` buck run //caffe2/benchmarks/operator_benchmark/pt:add_test -- --iterations 1 --tag_filter all # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: add # Mode: Eager # Name: add_M8_N32_K256_cpu # Input: M: 8, N: 32, K: 256, device: cpu Forward Execution Time (us) : 6798.688 ... Reviewed By: hl475 Differential Revision: D18520249 fbshipit-source-id: 4d55af9f46f89b2fe8842e1a00dfa8e5acaf4fa2	2019-11-14 20:19:05 -08:00
Mingzhe Li	5da2bf945e	add embeddingbag to benchmark_all_test (#29830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29830 as title Test Plan: na Reviewed By: hl475 Differential Revision: D18506023 fbshipit-source-id: 15693894c0aa736ab3e818bc740099f0d629cb84	2019-11-14 20:13:57 -08:00
Mingzhe Li	747233e3bd	minir edit to fix benchmark_all_test cuda error (#29829 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29829 This diff replaces the if check cuda with to(device...) which is a much cleaner interface. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cpu # Input: M: 64, N: 64, K: 64, device: cpu Forward Execution Time (us) : 129.548 # Benchmarking PyTorch: add # Mode: Eager # Name: add_M64_N64_K64_cuda # Input: M: 64, N: 64, K: 64, device: cuda Forward Execution Time (us) : 48.313 ... Reviewed By: bddppq Differential Revision: D18507568 fbshipit-source-id: 32534e76b2e27d59a631a4d76a0d93700e975ea4	2019-11-14 11:13:36 -08:00

1 2 3 4 5 ...

317 Commits