pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Ilia Cherniavskii	d8c384544e	Destroy CUDA events after profiling (#39962 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962 Adding a simple wrapper with ref count for cuda event and destroying cuda event after the last copy is destroyed Test Plan: CI cuda profiler tests Differential Revision: D22027092 Pulled By: ilia-cher fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472	2020-06-23 10:44:39 -07:00
Wojciech Baranowski	43331609a4	Port addmm, addbmm, addr to ATen (CUDA) (#38421 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421 Differential Revision: D22138333 Pulled By: VitalyFedyunin fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700	2020-06-22 13:02:33 -07:00
Vasiliy Kuznetsov	e35199a691	observer bench: add CUDA (#39360 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360 Makes the observer microbenchmarks also run on CUDA. This is useful now that QAT is supported in DDP and is more likely to be run on GPUs. Test Plan: ``` python -m pt.qobserver_test ``` Imported from OSS Differential Revision: D21828985 fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435	2020-06-05 14:18:32 -07:00
Edward Yang	da2004e132	Upgrade lint. (#39483 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483 I fixed all of the new errors that occurred because of the upgrade. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Differential Revision: D21884575 Pulled By: ezyang fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685	2020-06-04 12:56:43 -07:00
Michael Voznesensky	fce01a9bab	[JIT] Make new zip serialization for torch save/load significantly (~70%) faster (#38379 ) Summary: Before: ``` 2020-05-11 18:31:41 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 17.8048762, "median": 17.458917 }, "Big Tensors Load": { "mean": 3.2556887, "median": 2.9668495000000004 }, "Small Tensors Save": { "mean": 4.0381357, "median": 3.9440125 }, "Small Tensors Load": { "mean": 5.8792499, "median": 5.603067 }, "benchmark_run_at": "2020-05-12T01:31:41" } ``` After ``` Use zipfile serialization: True 2020-05-12 20:15:32 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 4.7534657, "median": 4.646732 }, "Big Tensors Load": { "mean": 3.6001919, "median": 3.493285 }, "Small Tensors Save": { "mean": 4.1066924, "median": 4.1219255 }, "Small Tensors Load": { "mean": 6.3902358, "median": 6.36977 }, "benchmark_run_at": "2020-05-13T03:15:32" } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/38379 Differential Revision: D21779494 Pulled By: voznesenskym fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a	2020-05-29 01:56:18 -07:00
Nikita Shulga	c02e7c464a	Replace import cpp_benchmark with `torch.utils.cpp_benchmark` (#38832 ) Summary: Otherwise, I don't understand how those could have been invoked Also, what is the benefit of importing the same module twice? Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832 Differential Revision: D21675081 Pulled By: malfet fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421	2020-05-20 18:53:09 -07:00
Ilia Cherniavskii	a94fb71b12	Memory profiling (#37775 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775 Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) ``` ``` --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Reviewed By: ngimel Differential Revision: D21384248 Pulled By: ilia-cher fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3	2020-05-19 15:48:48 -07:00
Peter Bell	0a159b0a3a	Fix precision issues in CPU remainder (#38293 ) Summary: Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861. This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals. Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`. I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293 Differential Revision: D21539801 Pulled By: ezyang fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5	2020-05-14 08:54:32 -07:00
Supriya Rao	ae11718c45	[quant] Add quantized::conv1d op benchmarck (#38332 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332 Test Plan: python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0 Forward Execution Time (us) : 147.844 python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu Forward Execution Time (us) : 470.750 Imported from OSS Differential Revision: D21553662 fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387	2020-05-13 16:59:19 -07:00
Mikhail Zolotukhin	9a2d8dfe63	[TensorExpr] Benchmarks: set up profiling executor and fuser according to the given arguments. (#38295 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38295 Test Plan: Imported from OSS Differential Revision: D21525741 Pulled By: ZolotukhinM fbshipit-source-id: 8bf1d54da062c8e0653bb2cb627883ae4ed14774	2020-05-12 23:27:46 -07:00
Ilia Cherniavskii	facc5e0cc4	Make profiler thread local (#36291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291 Move profiler state to be a thread local property, reuse existing thread local propagation mechanism to ensure correct profiling of async tasks. This also makes push/pop callback thread safe and easier to use in e.g. distributed profilier Test Plan: USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install ./build/bin/test_jit ./build/bin/test_jit python test/test_autograd.py python test/test_jit.py Differential Revision: D20938501 Pulled By: ilia-cher fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845	2020-05-07 14:52:49 -07:00
Vasiliy Kuznetsov	4fa049c525	add quantized instancenorm operator (#36847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847 Adds a quantized instancenorm operator, which can reuse most of groupnorm's logic. Benchmarking shows that the quantized version is about 10x faster than floating point for equivalent input sizes (https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm ``` Imported from OSS Differential Revision: D21107925 fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4	2020-05-06 19:01:33 -07:00
Vasiliy Kuznetsov	b837d5d418	add quantized groupnorm operator (#36835 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835 Adds a quantized groupnorm operator. We reuse most of the layernorm kernel, modifying it to be able to perform channel-wise scaling. Benchmark results: the quantized layer is between 6x to 15x faster from fp to q, depending on input shapes (full results: https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9) Test Plan: ``` python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm ``` Numerics are nearly equivalent, with the only difference documented in the test case. The difference is the same type as with quantized layernorm. Making numerics equivalent is possible but will sacrifice speed. Imported from OSS Differential Revision: D21107926 fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45	2020-05-06 19:01:26 -07:00
Vasiliy Kuznetsov	2773ed3082	hardswish: remove unnecessary quantize call (#36980 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980 Missed this on the original diff, fixing. Create the output tensor directly instead of quantizing it. Test Plan: tests still pass microbenchmarks show a 2x performance improvment for int8: https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this will depend on input size) Imported from OSS Differential Revision: D21185970 fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2	2020-04-22 16:15:54 -07:00
David Reiss	e75fb4356b	Remove (most) Python 2 support from Python code (#35615 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615 Python 2 has reached end-of-life and is no longer supported by PyTorch. Now we can clean up a lot of cruft that we put in place to support it. These changes were all done manually, and I skipped anything that seemed like it would take more than a few seconds, so I think it makes sense to review it manually as well (though using side-by-side view and ignoring whitespace change might be helpful). Test Plan: CI Differential Revision: D20842886 Pulled By: dreiss fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed	2020-04-22 09:23:14 -07:00
Vasiliy Kuznetsov	13391cebe2	ai-pep: match the qlinear benchmark to linear (#36674 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674 Slight changes to qlinear benchmark to have it be in the same format as linear, for fairer comparisons between FP and Q. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.linear_test python -m pt.qlinear_test ``` Imported from OSS Differential Revision: D21102562 fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6	2020-04-20 09:46:32 -07:00
Vasiliy Kuznetsov	25649684ed	ai-pep: align qconv benchmark to conv (#36673 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673 Slight changes to the qconv benchmark to make it match the floating point benchmark, so we can compare across the two better. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qconv_test --tag_filter all python -m pt.conv_test --tag_filter all ``` Imported from OSS Differential Revision: D21102563 fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659	2020-04-20 09:44:09 -07:00
Vasiliy Kuznetsov	a5d0d762fa	redo of add quantized layer norm implementation (#36593 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593 This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a better test. Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Differential Revision: D21030268 Pulled By: vkuzo fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f	2020-04-15 19:47:18 -07:00
Supriya Rao	73f11a0b23	Update qbatch_norm2d opbenchmark test (#36630 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630 Test Plan: OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test Imported from OSS Differential Revision: D21030508 fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91	2020-04-14 17:09:18 -07:00
Hameer Abbasi	7c825bad10	[RELAND] Add __torch_function__ benchmarks (#36138 ) Summary: Re-land of https://github.com/pytorch/pytorch/issues/35530 and https://github.com/pytorch/pytorch/issues/34645 Pull Request resolved: https://github.com/pytorch/pytorch/pull/36138 Differential Revision: D20893770 Pulled By: ezyang fbshipit-source-id: 75ab688a086f5fb87412a853df5246c0c39704ca	2020-04-10 09:14:31 -07:00
Edward Yang	88c22070fe	Revert D20768930: add quantized layer norm implementation Test Plan: revert-hammer Differential Revision: D20768930 Original commit changeset: ddf8727e9840 fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97	2020-04-09 14:36:37 -07:00
Vasiliy Kuznetsov	f813e7184e	add quantized layer norm implementation (#35329 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329 Adds a quantized implementation of LayerNorm for server. A future PR will add the Python wrapper. Test Plan: numerics match the floating point implementation benchmarks by input size: v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13 v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2 v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b Imported from OSS Differential Revision: D20768930 fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028	2020-04-09 09:11:41 -07:00
Shen Li	76c7652cc5	Add distributed data parallel benchmark tool (#35198 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198 The need for this tool was motivated by #28883. In the past, we have done ad-hoc benchmarking, but it's time for something more structured. It would be nice to add more model architectures so that we can get a full picture of the performance impact of a code change simply by running this suite a few times. Test Plan: Imported from OSS Differential Revision: D20591296 Pulled By: mrshenli fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49	2020-04-08 15:07:03 -07:00
Vasiliy Kuznetsov	cc78914755	qactivation_benchmarks: small bug fix (#35731 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731 Changes relu and relu6 to point to the functional implementations here. The previous behavior tested the time to create the module, but didn't actually run the function (I noticed this when adding the new input sizes and seeing the measured time not change). Test Plan: run the benchmark, the time now changes as expected with input size for these. Imported from OSS Differential Revision: D20875542 fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555	2020-04-06 15:02:33 -07:00
Vasiliy Kuznetsov	6405f26a02	add more quantized activation benchmarks and input sizes (#35729 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729 * there were a few quantized activations which had implementations but not benchmarks, adds them * adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations Test Plan: ``` python -m pt.qactivation_test ``` Imported from OSS Differential Revision: D20875544 fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d	2020-04-06 15:02:29 -07:00
Vasiliy Kuznetsov	b68c3827de	add benchmark for quantized batchnorm (#35389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35389 Adds a benchmark for quantized batchnorm, with the parameters the same compared to floating point batchnorm benchmark. Test Plan: run benchmarks https://gist.github.com/vkuzo/c49be58abdf0ff64797fab3936d0cb15 Imported from OSS Differential Revision: D20875543 fbshipit-source-id: ced89fbe2d18168e92950d0b74ca638aba54cd96	2020-04-06 15:01:05 -07:00
Mikhail Zolotukhin	9fe3b1857d	[TensorExpr] Fix imports in tensorexpr benchmarks. (#35830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35830 Test Plan: Imported from OSS Differential Revision: D20799464 Pulled By: ZolotukhinM fbshipit-source-id: 1b5981ad15042f601a9b6eb01a799cdf71200666	2020-04-01 14:23:33 -07:00
Michael Suo	6491bf2855	Revert D20777341: [pytorch][PR] Add __torch_function__ benchmarks. Test Plan: revert-hammer Differential Revision: D20777341 Original commit changeset: 6aaaf2a07553 fbshipit-source-id: 1c324f91f85ac624bf878297c96c682a46958954	2020-04-01 10:23:00 -07:00
Hameer Abbasi	8c534bb0bd	Add __torch_function__ benchmarks. (#35530 ) Summary: Since the last one was apparently reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35530 Differential Revision: D20777341 Pulled By: ezyang fbshipit-source-id: 6aaaf2a0755359074ae3d0efe32018d78dafe976	2020-04-01 06:30:17 -07:00
Bram Wasti	a3e10d2a17	Expose enablement of TensorExpr fuser as env variable (#35341 ) Summary: This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/ ``` PYTORCH_TENSOREXPR=1 python benchmark.py ``` This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser" Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341 Reviewed By: ZolotukhinM Differential Revision: D20676348 Pulled By: bwasti fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464	2020-03-26 14:31:57 -07:00
Alban Desmaison	4d39aeec27	Revert D20653072: [pytorch][PR] Add __torch_function__ benchmarks. Test Plan: revert-hammer Differential Revision: D20653072 Original commit changeset: e7e363f8a1b8 fbshipit-source-id: e75e4979399d6fee10e00a673ea45b9bcc0fd447	2020-03-26 13:36:59 -07:00
Hameer Abbasi	bf24753570	Add __torch_function__ benchmarks. (#34645 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34645 Differential Revision: D20653072 Pulled By: ezyang fbshipit-source-id: e7e363f8a1b84fc0c354586e266a695e4a2ea60e	2020-03-26 11:29:10 -07:00
Vasiliy Kuznetsov	f1efe51028	add quantized version of hardswish operator (#34820 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34820 Adds quantized version of hardswish, for common quantized operator coverage. Note: * we carry over scale and zero_point from the input to the output, because the range of the output is unbounded if x > 0 * we also skip the .out function to not allow the user to specify a custom scale+zp (flexible on this). Test Plan: ``` python test/test_quantized.py https://gist.github.com/vkuzo/f9b579315ed7f5fdb24839e3218d8465 ``` Imported from OSS Differential Revision: D20472905 fbshipit-source-id: 0f2a83e9f5f7b43485fa46caf30e756dc5d492a9	2020-03-24 15:16:58 -07:00
Vasiliy Kuznetsov	f3e9fa6122	add hardswish FP operator (#34747 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747 Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for common operator coverage, since this is widely used. A future PR will add the quantized version. CUDA is saved for a future PR as well. Test Plan: tests pass: ``` python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32 ``` microbenchmark: https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca (batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s) Imported from OSS Differential Revision: D20451404 fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2	2020-03-24 15:15:34 -07:00
Mikhail Zolotukhin	8998a1b3d3	Add tensorexpr benchmarks. (#35064 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35064 Test Plan: Imported from OSS Differential Revision: D20543695 Pulled By: ZolotukhinM fbshipit-source-id: 1cf294ab19465cb93557c2b195252c739b40a0f7	2020-03-20 12:01:31 -07:00
Vasiliy Kuznetsov	bf41a7624e	fix missing comma in activation benchmarks (#35104 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35104 I missed this in https://github.com/pytorch/pytorch/pull/34959 after a rebase, fixing. Test Plan: running benchmarks no longer crashes CI Imported from OSS Differential Revision: D20560908 fbshipit-source-id: a5494e23953d3c9007e9874d673896291b5322e0	2020-03-20 11:36:05 -07:00
Vasiliy Kuznetsov	37b234a880	quantized hardsigmoid, take 2 (#34959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34959 Adds quantized implementation of hardsigmoid. Original PR was https://github.com/pytorch/pytorch/pull/34607 and had to be reverted for a test breakage, trying again. Test Plan: tests benchmarks Imported from OSS Differential Revision: D20514212 fbshipit-source-id: cc7ae3b67757e2dde5c313c05ce60a0f2625d961	2020-03-19 13:27:22 -07:00
Shen Li	95f1cb34b9	Revert D20480546: adds quantized implementation of hard sigmoid Test Plan: revert-hammer Differential Revision: D20480546 Original commit changeset: 9febcb44afd9 fbshipit-source-id: 4461b455e63448cf45237e23c988b492c3e0f1b0	2020-03-17 19:58:08 -07:00
Vasiliy Kuznetsov	58c5b6d306	adds quantized implementation of hard sigmoid (#34607 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34607 Adds quantized version of hardsigmoid activation. Note: not implementing the _ and .out versions is currently intended, because the implementation changes the scale and zp and it's nice to not allow the user to specify scale and zp. Lmk if we should handle this differently. Test Plan: tests benchmarks Imported from OSS Differential Revision: D20480546 fbshipit-source-id: 9febcb44afd920125ed2ca4900492f0b712078ea	2020-03-17 16:01:39 -07:00
Rohan Varma	1e140c353c	[profiler][rpc] fix a race condition in the profiler when multiple threads call (#33719 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33719 We were seeing a strange error where gathering profiler events (specifically `parse_cpu_trace` in `profiler.py`) would fail with the error: `IndexError: pop from empty list`. It turned out that this was because for one particular `Event`, there was a pop recorded but not a push. Instead of the `push` event being completely missing, it was overwritten by a completely different event. After a bunch of debugging, and trying several hypotheses, it turns out that this was a race condition in `RangeEventList::record`. What happened was that different threads would call into `RangeEventList::record` on the same event list instance, and one record would stomp over the data written by the other one. Somehow the data written was a valid `Event` so the error did not manifest itself until the profiler realized a `pop` was missing a matching `push` in the python code. I fixed this by adding a lock to serialize writes to `RangeEventList::record`. This PR also makes a small change to pass in the `RecordFunction` name into `popRange`. It makes the debugging easier when investigating the events recorded. Differential Revision: D20071125 fbshipit-source-id: 70b51a65bcb833a7c88b7462a978fd3a39265f7e	2020-03-16 18:41:16 -07:00
Vasiliy Kuznetsov	1bac5fd0d3	add hardsigmoid FP operator to PyTorch (#34545 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545 This is for common operator coverage, since this is widely used. A future PR will add the quantized version. Some initial questions for reviewers, since it's my first FP operator diff: * do we need a backwards.out method for this? * do we need CUDA? If yes, should it be this PR or is it ok to split Test Plan: ``` // test python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32 // benchmark python -m pt.hardsigmoid_test ... Forward Execution Time (us) : 40.315 Forward Execution Time (us) : 42.603 ``` Imported from OSS Differential Revision: D20371692 fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327	2020-03-16 15:24:12 -07:00
Mikhail Zolotukhin	976d6aaa51	Revert D20251830: [TensorExpr] Add tensorexpr benchmarks. Test Plan: revert-hammer Differential Revision: D20251830 Original commit changeset: bafd66ce32f6 fbshipit-source-id: d8aea4b26441d8aba90c11d7350d3424df494052	2020-03-16 13:20:16 -07:00
Mikhail Zolotukhin	e93e7b2795	[TensorExpr] Add tensorexpr benchmarks. (#34230 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34230 This PR adds some benchmarks that we used to assess tensor expressions performance. Differential Revision: D20251830 Test Plan: Imported from OSS Pulled By: ZolotukhinM fbshipit-source-id: bafd66ce32f63077e3733112d854f5c750d5b1af	2020-03-16 11:49:39 -07:00
Vasiliy Kuznetsov	43c9cc7a9c	add quantized ELU activation (#34267 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34267 Adds quantized ELU. Test Plan: ``` python test/test_quantized.py TestQuantizedOps.test_qelu ``` still need to benchmark, saving that for after the review comments Imported from OSS Differential Revision: D20370953 fbshipit-source-id: fe941bf966f72dd9eee2c4b2ef45fe7afb50c866	2020-03-12 09:31:00 -07:00
Vasiliy Kuznetsov	2e88a78d2e	add quantized_hardtanh (#34097 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34097 Adds quantized hardtanh. Calls the clamp kernel behind the scenes. Test Plan: ``` python test/test_quantized.py ``` Imported from OSS Differential Revision: D20208860 fbshipit-source-id: 165a6a1c22f1dcc479679e5ea0c990d0e9c3b6c5	2020-03-10 22:27:15 -07:00
Wojciech Baranowski	b10a39bb32	Migrate _cat from TH to ATen (CUDA) (#33237 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/24520 Benchmarks: Upstream: ``` $ python -m pt.cat_test --tag_filter all --device cuda --omp_num_threads 1 --mkl_num_threads 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 17.355 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 30.718 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 17.329 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 30.176 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 74.417 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 75.728 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 190.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa8876fcf28>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa8876fcf28>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 57.711 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7fa886237048>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7fa886237048>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 49.903 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7fa7b57bb840>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7fa7b57bb840>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 84.181 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bba60>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bba60>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 82.339 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7fa7b57bbae8>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7fa7b57bbae8>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 82.312 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7fa7b57bbb70>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7fa7b57bbb70>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 90.715 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 129.021 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 142.966 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 387.023 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbbf8>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbbf8>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 36.647 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbc80>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbc80>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 278.890 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbd08>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbd08>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 557.752 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7fa7b57bbd90>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7fa7b57bbd90>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 842.512 ``` New version: ``` $ python -m pt.cat_test --tag_filter all --device cuda --omp_num_threads 1 --mkl_num_threads 1 # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 24.419 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 25.025 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 24.247 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 25.098 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 74.441 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 74.866 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 189.280 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1c9b056048>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1c9b056048>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 57.629 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f1c9b0560d0>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f1c9b0560d0>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 49.975 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f1bce8f38c8>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f1bce8f38c8>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 83.643 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3ae8>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3ae8>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 82.307 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f1bce8f3b70>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f1bce8f3b70>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 82.323 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f1bce8f3bf8>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f1bce8f3bf8>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 90.549 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 129.022 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 142.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 386.973 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3c80>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3c80>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 43.800 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3d08>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3d08>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 279.023 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3d90>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3d90>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 565.790 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f1bce8f3e18>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f1bce8f3e18>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 845.153 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/33237 Differential Revision: D20069181 Pulled By: ngimel fbshipit-source-id: b392e1ffd72c0d8df0c5a2d3ac96f59b37c84e32	2020-02-24 17:41:16 -08:00
comet	9a2691f2fc	Fix spelling errors Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32673 Differential Revision: D19597118 Pulled By: pietern fbshipit-source-id: f88c1da7548fcee141ed248f5f49d25c1d639955	2020-01-28 04:46:15 -08:00
Huamin Li	52f8f031ac	add diag into pt operator microbenchmark (#32597 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32597 Currently, there is no benchmark test about diag operator. This diff will add one into the suite. Test Plan: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim1_M64_N64_diagonal0_outTrue_cpu # Input: dim: 1, M: 64, N: 64, diagonal: 0, out: True, device: cpu Forward Execution Time (us) : 28.496 # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim2_M128_N128_diagonal-10_outFalse_cpu # Input: dim: 2, M: 128, N: 128, diagonal: -10, out: False, device: cpu Forward Execution Time (us) : 45.179 # Benchmarking PyTorch: diag # Mode: Eager # Name: diag_dim1_M256_N256_diagonal20_outTrue_cpu # Input: dim: 1, M: 256, N: 256, diagonal: 20, out: True, device: cpu Forward Execution Time (us) : 49.009 ``` Reviewed By: mingzhe09088 Differential Revision: D19564024 fbshipit-source-id: 828a3e0e0e06810a77eb5ddb734efd30e4a63acf	2020-01-24 15:41:04 -08:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Zafar Takhirov	0ae063d5d9	Fixed concatenation benchmark + added it to the microbenchmarking runs Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31587 Test Plan: Imported from OSS Differential Revision: D19221813 Pulled By: z-a-f fbshipit-source-id: ee0eb60da7899b23fdc63326302d1e2fd4b540ee	2020-01-03 11:23:12 -08:00

1 2 3 4 5

244 Commits