pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Shijun Kong	6ae0a7c919	Add ReplaceNaN benchmark as baseline (#46685 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685 as title Test Plan: caffe2 ``` ./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: replace_nan WARNING: Logging before InitGoogleLogging() is written to STDERR W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: replace_nan_M16_N16_dtypefloat # Input: M: 16, N: 16, dtype: float Forward Execution Time (us) : 30.742 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M16_N16_dtypedouble # Input: M: 16, N: 16, dtype: double Forward Execution Time (us) : 29.135 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypefloat # Input: M: 64, N: 64, dtype: float Forward Execution Time (us) : 94.059 # Benchmarking Caffe2: replace_nan # Name: replace_nan_M64_N64_dtypedouble # Input: M: 64, N: 64, dtype: double Forward Execution Time (us) : 93.569 ``` Reviewed By: qizzzh, houseroad Differential Revision: D24448483 fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f	2020-10-22 19:12:14 -07:00
Yang Wang	920ec6651f	[OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694 For the op with parameters (e.g. conv), the jit mode run currently will raise an error of `RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module` Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par --use_jit Reviewed By: mingzhe09088 Differential Revision: D24451206 fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6	2020-10-22 11:10:28 -07:00
Shijun Kong	e5a2ba2ea1	Fix benchmark_caffe2 Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only. Test Plan: `buck run caffe2/benchmarks/operator_benchmark/c2:add_test` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking Caffe2: add WARNING: Logging before InitGoogleLogging() is written to STDERR W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls. # Name: add_M8_N16_K32_dtypeint # Input: M: 8, N: 16, K: 32, dtype: int Forward Execution Time (us) : 652.748 # Benchmarking Caffe2: add # Name: add_M16_N16_K64_dtypefloat # Input: M: 16, N: 16, K: 64, dtype: float Forward Execution Time (us) : 63.570 # Benchmarking Caffe2: add # Name: add_M64_N64_K128_dtypeint # Input: M: 64, N: 64, K: 128, dtype: in ``` Reviewed By: qizzzh Differential Revision: D24448374 fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6	2020-10-22 08:09:06 -07:00
Mingzhe Li	8908f6ad8e	[op-bench] modify import path of configs (#46679 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679 Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue. ghstack-source-id: 114857978 Test Plan: waitforsandcastle Reviewed By: vkuzo Differential Revision: D24459631 fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5	2020-10-21 16:15:11 -07:00
Hao Lu	1a3ea46dbf	[StaticRuntime] Threading model (#46219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219 - Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`: ``` struct InferenceModule { explicit InferenceModule(const torch::jit::Module& m); explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g); torch::jit::Module module; std::shared_ptr<torch::jit::Graph> graph; std::unique_ptr<c10::FunctionSchema> schema; std::unordered_map<Value*, size_t> value_to_reg; std::vector<size_t> input_regs; // inputs to the graph std::vector<size_t> output_regs; // outputs of the graph std::vector<size_t> internals; }; ``` which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime: ``` mutable std::vector<IValue> reg_; // The nodes we need to run std::vector<ProcessedNode> nodes_; ``` `reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`. - Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example. - Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable. This diff focuses on threading model only. Benchmarks will be separate. Reviewed By: bwasti Differential Revision: D24237078 fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1	2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin	e5ed037529	[StaticRuntime] Add a 'speed of light' benchmark. (#46308 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308 This PR adds a hand optimized version of DeepAndWide model with the goal of estimating overheads of static runtime. While static runtime is currently much faster than the existing JIT interpreter, it would be useful to understand how close we are to an absolutely 0-overhead system. Currently, this "ideal" implementation is 2x faster than the static runtime on batchsize=1. Full benchmark results: ``` Running build/bin/static_runtime_bench Run on (24 X 2394.71 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 4096K (x24) L3 Unified 16384K (x24) ------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------ BM_deep_wide_base/1 59518 ns 59500 ns 10909 BM_deep_wide_base/8 74635 ns 74632 ns 9317 BM_deep_wide_base/20 82186 ns 82147 ns 9119 BM_deep_wide_fast/1 13851 ns 13851 ns 49825 << new BM_deep_wide_fast/8 22497 ns 22497 ns 32089 << new BM_deep_wide_fast/20 23868 ns 23841 ns 31184 << new BM_deep_wide_jit_graph_executor/1 62786 ns 62786 ns 10835 BM_deep_wide_jit_graph_executor/8 76730 ns 76718 ns 7529 BM_deep_wide_jit_graph_executor/20 78886 ns 78883 ns 8769 BM_deep_wide_jit_profiling_executor/1 69504 ns 69490 ns 10309 BM_deep_wide_jit_profiling_executor/8 75718 ns 75715 ns 9199 BM_deep_wide_jit_profiling_executor/20 75364 ns 75364 ns 9010 BM_deep_wide_static/1 40324 ns 40318 ns 17232 BM_deep_wide_static/8 50327 ns 50319 ns 13335 BM_deep_wide_static/20 53075 ns 53071 ns 12855 BM_deep_wide_static_threaded/threads:8 6258 ns 49873 ns 14008 ``` PS: The implementation could probably be optimized even more. Differential Revision: D24300702 Test Plan: Imported from OSS Reviewed By: dzhulgakov Pulled By: ZolotukhinM fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e	2020-10-19 23:35:55 -07:00
Bugra Akyildiz	03c7d5be6b	Add operator benchmark for 4bit/8bit embedding lookups Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`. Test Plan: ``` buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test' ```` The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474 https://www.internalfb.com/intern/aibench/details/346463246 Reviewed By: radkris-git Differential Revision: D24268413 fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d	2020-10-15 13:51:32 -07:00
Bert Maher	b7261de0df	[pytorch][te] Add compilation time benchmark (#46124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124 We want to make sure we can actually fuse kernels within a fairly tight time budget. So here's a quick benchmark of codegen for a simple pointwise activation function (swish). I kept all the intermediate tensors separate to force TE to actually do inlining. Test Plan: ``` buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench ``` I've only run in debug mode so results aren't super meaningful, but even in that mode it's 18ms for compilation, 15 of which are in llvm. Update, opt build mode: ``` ---------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------- BM_CompileSwish 5123276 ns 5119846 ns 148 BM_CompileSwishLLVMOnly 4754361 ns 4753701 ns 160 ``` Reviewed By: asuhan Differential Revision: D24232801 fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76	2020-10-09 23:11:37 -07:00
shmsong	43fe45ab0f	[JIT] Add dynamic shape benchmark for NV Fuser (#46107 ) Summary: This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph. Changes include: Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench` Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107 Reviewed By: glaringlee Differential Revision: D24229400 Pulled By: bertmaher fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683	2020-10-09 22:09:21 -07:00
Supriya Rao	31888b2e77	[quant][pyper] Rename the sparse argument for embedding_bag ops (#46003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003 sparse is confusing because itt is used in training for sparse gradients Test Plan: Imported from OSS Reviewed By: radkris-git, qizzzh Differential Revision: D24178248 fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d	2020-10-08 16:15:28 -07:00
Shijun Kong	7d4f5060ad	Fix doc about operator benchmark (#45853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853 The method name in README is not consistent with actual implementation. Reviewed By: qizzzh Differential Revision: D24114849 fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883	2020-10-08 09:13:53 -07:00
Bert Maher	f2e569461b	[te] Tiled (m=32 x n=32) gemm benchmark (#45905 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905 Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142402 Pulled By: bertmaher fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f	2020-10-06 16:57:31 -07:00
Bert Maher	50f89578dd	[te] Add a benchmark harness (#45875 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875 Adds a googlebenchmark harness for perf testing programs generated by tensorexpr, sans any pytorch wrappings (for python-level benchmarks of tensorexpr, see benchmarks/tensorexpr). Currently there's a harness for gemm that sets up the problem using torch (and also measures the perf of a torch::mm to give a baseline). Right now there's just an unoptimized implementation that is expected to be not very fast. More optimized versions are coming. Sample output from my dev box: ``` Run on (48 X 2501 MHz CPU s) CPU Caches: L1 Data 32K (x24) L1 Instruction 32K (x24) L2 Unified 256K (x24) L3 Unified 30720K (x2) -------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------- Gemm/Torch/128/128/128 73405 ns 73403 ns 8614 GFLOPS=57.1411G/s Gemm/TensorExprNoopt/128/128/128 3073003 ns 3072808 ns 229 GFLOPS=1.36497G/s ``` Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D24142403 Pulled By: bertmaher fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597	2020-10-06 16:57:27 -07:00
Mingzhe Li	e829d4fba9	[op-bench] fix jit mode (#45774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774 Fix RuntimeError: No such operator operator_benchmark::_consume Test Plan: waitforsandcastle Reviewed By: ngimel Differential Revision: D24064982 fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c	2020-10-05 09:29:41 -07:00
Hao Lu	2b48dd168d	[StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640 Reviewed By: dzhulgakov Differential Revision: D23996656 fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5	2020-10-02 23:03:05 -07:00
Ilia Cherniavskii	f5c95d5cf1	Source code level attribution in profiler (#43898 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898 Adding with_source parameter to enable tracking source code (filename and line) in profiler for eager, torchscript and autograd modes Test Plan: python test/test_profiler.py ``` Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Source Location ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- -------------------------------------------- ts_method_1 10.43% 235.364us 36.46% 822.920us 822.920us 1 test/test_profiler.py(70): test_source aten::add 7.52% 169.833us 8.88% 200.439us 200.439us 1 test/test_profiler.py(69): test_source aten::normal_ 6.26% 141.380us 6.26% 141.380us 141.380us 1 test/test_profiler.py(67): test_source aten::add 5.80% 130.830us 8.41% 189.800us 63.267us 3 test/test_profiler.py(72): test_source aten::sum 5.02% 113.340us 8.39% 189.475us 189.475us 1 test/test_profiler.py(64): ts_method_1 aten::add 4.58% 103.346us 6.33% 142.847us 142.847us 1 test/test_profiler.py(62): ts_method_1 aten::mul 4.05% 91.498us 9.62% 217.113us 217.113us 1 test/test_profiler.py(71): test_source aten::add 4.03% 90.880us 5.60% 126.405us 126.405us 1 test/test_profiler.py(58): ts_method_2 aten::empty 3.49% 78.735us 3.49% 78.735us 19.684us 4 test/test_profiler.py(72): test_source ``` Reviewed By: ngimel Differential Revision: D23432664 Pulled By: ilia-cher fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615	2020-09-30 00:57:35 -07:00
Taylor Robie	ccad73ab41	Fix D23995953 import. Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported Test Plan: See https://github.com/pytorch/pytorch/pull/45511 Reviewed By: zhangguanheng66 Differential Revision: D23995953 fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4	2020-09-29 19:30:23 -07:00
Bram Wasti	87b356d093	[static runtime] Split out graph preparation from runtime (#44131 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604305 Pulled By: bwasti fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6	2020-09-28 13:01:23 -07:00
Mikhail Zolotukhin	bc5710f2f7	Benchmarks: tweak PE config settings. (#45349 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935518 Pulled By: ZolotukhinM fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc	2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin	8cef7326f4	Benchmarks: add 'default' options for fuser and executor. (#45347 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347 Test Plan: Imported from OSS Reviewed By: Krovatkin Differential Revision: D23935519 Pulled By: ZolotukhinM fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff	2020-09-26 23:09:02 -07:00
Bram Wasti	e5f6e5af13	Add Deep and wide to test and flatten/tranpose for good measure (#44129 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604302 Pulled By: bwasti fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12	2020-09-25 11:05:41 -07:00
Bram Wasti	d1a11618f5	[static runtime] Add _out variants and reuse memory (#44128 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128 Test Plan: Imported from OSS Reviewed By: hlu1 Differential Revision: D23604304 Pulled By: bwasti fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9	2020-09-25 11:03:06 -07:00
anjali411	58b6ab69e5	torch.sgn for complex tensors (#39955 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955 resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92	2020-09-22 08:24:53 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Kevin Stephano	26a91a9f04	[WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101 ) Summary: Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler. This support has some modifications besides adding an option to support the NVIDIA fuser: * Adds FP16 Datatype support * Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes * Adds IR printing and kernel printing knobs * Adds a knob `input_iter` to create ranges of inputs currently only for reductions * Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob. * Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise and reduction operations in the most minimal fashion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101 Reviewed By: ngimel Differential Revision: D23713658 Pulled By: bertmaher fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1	2020-09-15 15:10:49 -07:00
Mikhail Zolotukhin	37093f4d99	Benchmarks: make fuser and executor configurable from command line. (#44291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23569089 Pulled By: ZolotukhinM fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076	2020-09-09 11:59:35 -07:00
Mikhail Zolotukhin	6134ac17ba	Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2). Test Plan: revert-hammer Differential Revision: D23561500 (`589a2024c8`) Original commit changeset: 7fe86d34afa4 fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5	2020-09-07 19:10:30 -07:00
Mikhail Zolotukhin	589a2024c8	Benchmarks: re-enable profiling-te configuration (try 2). (#44270 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270 The previous PR (#44212) was reverted since I didn't update the `upload_scribe.py` script and it was looking for 'executor_and_fuser' field in the json which now is replaced with two separate fields: 'executor' and 'fuser'. Differential Revision: D23561500 Test Plan: Imported from OSS Reviewed By: ngimel Pulled By: ZolotukhinM fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470	2020-09-07 15:50:39 -07:00
Natalia Gimelshein	626e410e1d	Revert D23544563: Benchmarks: re-enable profiling-te configuration. Test Plan: revert-hammer Differential Revision: D23544563 (`ac1f471fe2`) Original commit changeset: 98659e8860fa fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788	2020-09-06 21:01:19 -07:00
Mikhail Zolotukhin	ac1f471fe2	Benchmarks: re-enable profiling-te configuration. (#44212 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23544563 Pulled By: ZolotukhinM fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c	2020-09-06 10:22:16 -07:00
Mikhail Zolotukhin	d0421ff1cc	Benchmarks: add scripts for FastRNNs results comparison. (#44134 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23505810 Pulled By: ZolotukhinM fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370	2020-09-03 13:44:42 -07:00
Mikhail Zolotukhin	d11603de38	[TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112 Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D23500904 Pulled By: ZolotukhinM fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573	2020-09-03 11:29:35 -07:00
Bert Maher	33d51a9b32	Respect canFuseOn{CPU,GPU} in TE fuser (#43967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967 Test Plan: Imported from OSS Reviewed By: asuhan Differential Revision: D23469048 Pulled By: bertmaher fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb	2020-09-02 18:00:25 -07:00
taivu	8722952dbd	Add benchmark for channel_shuffle operator (#43509 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509 Test Plan: Imported from OSS Reviewed By: kimishpatel Differential Revision: D23299972 Pulled By: kimishpatel fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754	2020-09-02 08:15:19 -07:00
Bram Wasti	6512032699	[Static Runtime] Add OSS build for static runtime benchmarks (#43881 ) Summary: Adds CMake option. Build with: ``` BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881 Reviewed By: hlu1 Differential Revision: D23430708 Pulled By: bwasti fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec	2020-09-02 08:00:18 -07:00
Hao Lu	8538a79bfe	[jit][static] Basic executor (#43647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647 Nothing fancy, just a basic implementation of the graph executor without using stack machine. Reviewed By: bwasti Differential Revision: D23208413 fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356	2020-08-28 23:20:07 -07:00
Mikhail Zolotukhin	c1553ff94b	Benchmarks: temporarily disable profiling-te configuration. (#43603 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603 We are in the midst of landing a big reword of profiling executor and benchmarks are expected to fail while we are in the transitional state. Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D23334818 Pulled By: ZolotukhinM fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170	2020-08-25 21:00:10 -07:00
albanD	e08e93f946	Reland of benchmark code (#43428 ) Summary: Reland of the benchmark code that broke the slow tests because the GPU were running out of memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428 Reviewed By: ngimel Differential Revision: D23296136 Pulled By: albanD fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851	2020-08-24 13:27:26 -07:00
Supriya Rao	7024ce8a2c	[quant] Add benchmarks for quantized embeddingbag module (#43296 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296 Use common config for float and quantized embedding_bag modules Test Plan: ``` python -m pt.qembeddingbag_test Benchmarking PyTorch: qEmbeddingBag Mode: Eager Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu Forward Execution Time (us) : 35.738 Benchmarking PyTorch: qEmbeddingBag Mode: Eager Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu Forward Execution Time (us) : 62.708 python -m pt.embeddingbag_test Benchmarking PyTorch: embeddingbag Mode: Eager Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu Forward Execution Time (us) : 46.878 Benchmarking PyTorch: embeddingbag Mode: Eager Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu Forward Execution Time (us) : 103.904 ``` Imported from OSS Reviewed By: vkuzo Differential Revision: D23245531 fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8	2020-08-24 09:51:03 -07:00
Alban Desmaison	74781ab5b8	Revert D23242101: [pytorch][PR] Implement first draft of autograd benchmark. Test Plan: revert-hammer Differential Revision: D23242101 (`c2511bdfa4`) Original commit changeset: a2b92d5a4341 fbshipit-source-id: bda562d15565f074b448022d180ec8f959c6ecc9	2020-08-21 12:22:57 -07:00
albanD	c2511bdfa4	Implement first draft of autograd benchmark. (#40586 ) Summary: It is quite a lot of code because I pulled some code from torchaudio and torchvision to remove issues I had to get latest version with pytorch built from source while I can't build there libs from source (dependency missing for torchaudio). The compare script generates table as follows: \| model \| task \| speedup \| mean (before) \| var (before) \| mean (after) \| var (after) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| resnet18 \| vjp \| 1.021151844124464 \| 1.5627719163894653 \| 0.005164200905710459 \| 1.5304011106491089 \| 0.003979875706136227 \| \| resnet18 \| vhp \| 0.9919114430761606 \| 6.8089728355407715 \| 0.019538333639502525 \| 6.86449670791626 \| 0.014775685034692287 \| \| resnet18 \| jvp \| 0.9715963084255123 \| 5.720699310302734 \| 0.08197150379419327 \| 5.887938499450684 \| 0.018408503383398056 \| \| ppl_simple_reg \| vjp \| 0.9529183269165618 \| 0.000362396240234375 \| 7.526952949810095e-10 \| 0.00038030146970413625 \| 7.726220357939795e-11 \| \| ppl_simple_reg \| vhp \| 0.9317708619586977 \| 0.00048058031825348735 \| 5.035701855504726e-10 \| 0.0005157709238119423 \| 3.250243477137538e-11 \| \| ppl_simple_reg \| jvp \| 0.8609755877018406 \| 0.00045447348384186625 \| 9.646707044286273e-11 \| 0.0005278587341308594 \| 1.4493808930815533e-10 \| \| ppl_simple_reg \| hvp \| 0.9764100147808232 \| 0.0005881547695025802 \| 7.618464747949361e-10 \| 0.0006023645401000977 \| 6.370915461850757e-10 \| \| ppl_simple_reg \| jacobian \| 1.0019173715134297 \| 0.0003612995205912739 \| 2.2979899233499523e-11 \| 0.0003606081008911133 \| 1.2609764794835332e-11 \| \| ppl_simple_reg \| hessian \| 1.0358429970264393 \| 0.00206911563873291 \| 2.590938796842579e-09 \| 0.0019975185859948397 \| 2.8916853356264482e-09 \| \| ppl_robust_reg \| vjp \| 1.0669910916521521 \| 0.0017304659122601151 \| 3.1047047155396967e-09 \| 0.0016218185191974044 \| 4.926861585374809e-09 \| \| ppl_robust_reg \| vhp \| 1.0181130455462972 \| 0.0029563189018517733 \| 2.6359153082466946e-08 \| 0.0029037236236035824 \| 1.020585038702393e-08 \| \| ppl_robust_reg \| jvp \| 0.9818360373406179 \| 0.0026934861671179533 \| 6.981357714153091e-09 \| 0.00274331565015018 \| 3.589908459389335e-08 \| \| ppl_robust_reg \| hvp \| 1.0270848910527002 \| 0.005576515104621649 \| 3.2798087801211295e-08 \| 0.005429458804428577 \| 6.438724398094564e-08 \| \| ppl_robust_reg \| jacobian \| 1.0543611284155785 \| 0.00167675013653934 \| 2.3236829349571053e-08 \| 0.001590299652889371 \| 1.2011492245278532e-08 \| \| ppl_robust_reg \| hessian \| 1.0535378727082656 \| 0.01643357239663601 \| 1.8450685956850066e-06 \| 0.015598463825881481 \| 2.1876705602608126e-07 \| \| wav2letter \| vjp \| 1.0060408105086573 \| 0.3516994118690491 \| 1.4463969819189515e-05 \| 0.349587619304657 \| 9.897866402752697e-05 \| \| wav2letter \| vhp \| 0.9873655295086051 \| 1.1196287870407104 \| 0.00474404776468873 \| 1.133955717086792 \| 0.009759620763361454 \| \| wav2letter \| jvp \| 0.9741820317882822 \| 0.7888165712356567 \| 0.0017476462526246905 \| 0.8097219467163086 \| 0.0018235758179798722 \| \| transfo \| vjp \| 0.9883954031921641 \| 2.8865864276885986 \| 0.008410997688770294 \| 2.9204773902893066 \| 0.006901870481669903 \| \| transfo \| vhp \| 1.0111290842971339 \| 8.374398231506348 \| 0.014904373325407505 \| 8.282224655151367 \| 0.04449500888586044 \| \| transfo \| jvp \| 1.0080534543381963 \| 6.293097972869873 \| 0.03796082362532616 \| 6.24282169342041 \| 0.010179692879319191 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/40586 Reviewed By: pbelevich Differential Revision: D23242101 Pulled By: albanD fbshipit-source-id: a2b92d5a4341fe1472711a685ca425ec257d6384	2020-08-21 07:36:26 -07:00
Supriya Rao	4fc9e958c4	[quant] Add benchmakrs for embedding_bag coversion ops (#43291 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291 Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops Test Plan: ``` python -m pt.qembedding_pack_tes ``` Imported from OSS Reviewed By: radkris-git Differential Revision: D23231641 fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd	2020-08-20 11:26:20 -07:00
Vasiliy Kuznetsov	5aa61afbfb	quant bench: update observer configs (#42956 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956 In preparation for observer perf improvement, cleans up the micro benchmarks: * disable CUDA for histogram observers (it's too slow) * add larger shapes for better representation of real workloads Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qobserver_test ``` Imported from OSS Reviewed By: supriyar Differential Revision: D23093996 fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a	2020-08-17 17:07:56 -07:00
Hao Lu	8864148823	[jit] DeepAndWide benchmark (#43096 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096 Add benchmark script for deep and wide model. Reviewed By: bwasti, yinghai Differential Revision: D23099925 fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b	2020-08-15 01:27:12 -07:00
Paul Shao	8b5642a786	Fix to Learnable Fake Quantization Op Benchmarking (#43018 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018 In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled. Test Plan: Use the following command to execute the benchmark test: `buck test mode/dev-nosan pt:quantization_test` Reviewed By: vkuzo Differential Revision: D23107846 fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39	2020-08-13 16:32:13 -07:00
Bert Maher	eb47940c0a	Add executor and fuser options to the fastrnn test fixture (#42946 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946 There are 3 options for the executor and fuser and some of them aren't super interesting so I've combined the options into a single parameter, but made it fairly easy to expand the set if there are other configs we might care about. Test Plan: Benchmark it Imported from OSS Reviewed By: zheng-xq Differential Revision: D23090177 fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e	2020-08-13 12:45:37 -07:00
Bert Maher	b8ae563ce6	Add a microbenchmark for LSTM elementwise portion (#42901 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42901 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23079714 Pulled By: bertmaher fbshipit-source-id: 28f8c3b5019ee898e82e64a0a674da1b4736d252	2020-08-12 17:11:47 -07:00
Bert Maher	33d209b5f4	Fix TE microbenchmark harness to use appropriate fuser/executor (#42900 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42900 Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23079715 Pulled By: bertmaher fbshipit-source-id: 6aa2b08a550835b7737e355960a16a7ca83878ea	2020-08-12 17:11:44 -07:00
Vasiliy Kuznetsov	57b056b5f2	align qlinear benchmark to linear benchmark (#42767 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767 Same as previous PR, forcing the qlinear benchmark to follow the fp one Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.linear_test python -m pt.qlinear_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23013937 fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8	2020-08-11 10:35:16 -07:00
Vasiliy Kuznetsov	a7bdf575cb	align qconv benchmark to conv benchmark (#42761 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761 Makes the qconv benchmark follow the conv benchmark exactly. This way it will be easy to compare q vs fp with the same settings. Test Plan: ``` cd benchmarks/operator_benchmark python -m pt.qconv_test python -m pt.conv_test ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23012533 fbshipit-source-id: af30ee585389395569a6322f5210828432963077	2020-08-11 10:33:19 -07:00

1 2 3 4 5 ...

317 Commits