Commit Graph

317 Commits

Author SHA1 Message Date
Shijun Kong
6ae0a7c919 Add ReplaceNaN benchmark as baseline (#46685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46685

as title

Test Plan:
caffe2

```
./buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/replace_nan_test.par

# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: replace_nan
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1022 10:09:48.508246 1887813 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: replace_nan_M16_N16_dtypefloat
# Input: M: 16, N: 16, dtype: float
Forward Execution Time (us) : 30.742

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M16_N16_dtypedouble
# Input: M: 16, N: 16, dtype: double
Forward Execution Time (us) : 29.135

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypefloat
# Input: M: 64, N: 64, dtype: float
Forward Execution Time (us) : 94.059

# Benchmarking Caffe2: replace_nan
# Name: replace_nan_M64_N64_dtypedouble
# Input: M: 64, N: 64, dtype: double
Forward Execution Time (us) : 93.569
```

Reviewed By: qizzzh, houseroad

Differential Revision: D24448483

fbshipit-source-id: 51574ca0eca6dba5828dfdc754193dba5a62954f
2020-10-22 19:12:14 -07:00
Yang Wang
920ec6651f [OpBench] fix jit mode run of operator benchmark for ops with parameters (#46694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46694

For the op with parameters (e.g. conv), the jit mode run currently will raise an error of
`RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient`. After consulting https://www.fburl.com/vtkys6ug, decided to turn-off gradient for the parameters in the forward run. If we want op with parameters to work in backward with jit mode, probably needs to turn `TorchBenchmarkBase` into a sub-class of `nn.Module`

Test Plan: ./buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par  --use_jit

Reviewed By: mingzhe09088

Differential Revision: D24451206

fbshipit-source-id: 784eb60ca155b0152d745c92f6d0ce6b2c9014c6
2020-10-22 11:10:28 -07:00
Shijun Kong
e5a2ba2ea1 Fix benchmark_caffe2
Summary: benchmakr_caffe2 is broken, due to some refactoring which change from eager test generation to register only.

Test Plan:
`buck run caffe2/benchmarks/operator_benchmark/c2:add_test`

```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking Caffe2: add
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1021 08:07:06.350742 390665 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: add_M8_N16_K32_dtypeint
# Input: M: 8, N: 16, K: 32, dtype: int
Forward Execution Time (us) : 652.748

# Benchmarking Caffe2: add
# Name: add_M16_N16_K64_dtypefloat
# Input: M: 16, N: 16, K: 64, dtype: float
Forward Execution Time (us) : 63.570

# Benchmarking Caffe2: add
# Name: add_M64_N64_K128_dtypeint
# Input: M: 64, N: 64, K: 128, dtype: in
```

Reviewed By: qizzzh

Differential Revision: D24448374

fbshipit-source-id: 850fd375d194c20c385ea4433aea13066c7476e6
2020-10-22 08:09:06 -07:00
Mingzhe Li
8908f6ad8e [op-bench] modify import path of configs (#46679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46679

Current way of import configs will have runtime error when a single benchmark is launched directly with buck(e.g. `/buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/conv_test.par`). The diff fixed that issue.
ghstack-source-id: 114857978

Test Plan: waitforsandcastle

Reviewed By: vkuzo

Differential Revision: D24459631

fbshipit-source-id: 29df17e66962a8604dbb7b8b9106713c3c19bed5
2020-10-21 16:15:11 -07:00
Hao Lu
1a3ea46dbf [StaticRuntime] Threading model (#46219)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46219

- Refactor StaticRuntime and group common data structures, the jit graph, and the script module into a separate struct `InferenceModule`:
```
struct InferenceModule {
  explicit InferenceModule(const torch::jit::Module& m);
  explicit InferenceModule(std::shared_ptr<torch::jit::Graph> g);
  torch::jit::Module module;
  std::shared_ptr<torch::jit::Graph> graph;
  std::unique_ptr<c10::FunctionSchema> schema;

  std::unordered_map<Value*, size_t> value_to_reg;
  std::vector<size_t> input_regs; // inputs to the graph
  std::vector<size_t> output_regs; // outputs of the graph
  std::vector<size_t> internals;
};
```
which is stored in the PyTorchPredictor, as well as the static runtime, and shared across threads. Then this is what's left inside the Static Runtime:
```
  mutable std::vector<IValue> reg_;
  // The nodes we need to run
  std::vector<ProcessedNode> nodes_;
```
`reg_` holds all the weights and activations, which is different across threads during running. `nodes_` holds the op nodes and input/output registers, and is the same across threads for now. We could potentially put other stateful data structures in it, so I kept it inside the static runtime. It could be easily moved into the `InferenceModule` if we decide not to anything else into `ProcessedNode`.

- Added StaticRuntimeOptions so we can toggle certain optimizations on/off, for testing and benchmarking. `cleanup_activations` is an example.

- Integration with PyTorchPredictor. Added a lockfree stack in the PyTorchPredictor to hold all the static runtime instances. Benchmark shows that the `push` and `pop` combo takes about 80 ns, which is quite acceptable.

This diff focuses on threading model only. Benchmarks will be separate.

Reviewed By: bwasti

Differential Revision: D24237078

fbshipit-source-id: fd0d6347f02b4526ac17dec1f731db48424bade1
2020-10-20 14:37:30 -07:00
Mikhail Zolotukhin
e5ed037529 [StaticRuntime] Add a 'speed of light' benchmark. (#46308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46308

This PR adds a hand optimized version of DeepAndWide model with the goal
of estimating overheads of static runtime. While static runtime is
currently much faster than the existing JIT interpreter, it would be
useful to understand how close we are to an absolutely 0-overhead
system. Currently, this "ideal" implementation is 2x faster than the
static runtime on batchsize=1.

Full benchmark results:
```
Running build/bin/static_runtime_bench
Run on (24 X 2394.71 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 4096K (x24)
  L3 Unified 16384K (x24)
------------------------------------------------------------------------------
Benchmark                                       Time           CPU Iterations
------------------------------------------------------------------------------
BM_deep_wide_base/1                         59518 ns      59500 ns      10909
BM_deep_wide_base/8                         74635 ns      74632 ns       9317
BM_deep_wide_base/20                        82186 ns      82147 ns       9119
BM_deep_wide_fast/1                         13851 ns      13851 ns      49825 << new
BM_deep_wide_fast/8                         22497 ns      22497 ns      32089 << new
BM_deep_wide_fast/20                        23868 ns      23841 ns      31184 << new
BM_deep_wide_jit_graph_executor/1           62786 ns      62786 ns      10835
BM_deep_wide_jit_graph_executor/8           76730 ns      76718 ns       7529
BM_deep_wide_jit_graph_executor/20          78886 ns      78883 ns       8769
BM_deep_wide_jit_profiling_executor/1       69504 ns      69490 ns      10309
BM_deep_wide_jit_profiling_executor/8       75718 ns      75715 ns       9199
BM_deep_wide_jit_profiling_executor/20      75364 ns      75364 ns       9010
BM_deep_wide_static/1                       40324 ns      40318 ns      17232
BM_deep_wide_static/8                       50327 ns      50319 ns      13335
BM_deep_wide_static/20                      53075 ns      53071 ns      12855
BM_deep_wide_static_threaded/threads:8       6258 ns      49873 ns      14008
```

PS: The implementation could probably be optimized even more.

Differential Revision: D24300702

Test Plan: Imported from OSS

Reviewed By: dzhulgakov

Pulled By: ZolotukhinM

fbshipit-source-id: 7870bdef127c39d11bcaa4f03a60eb80a46be58e
2020-10-19 23:35:55 -07:00
Bugra Akyildiz
03c7d5be6b Add operator benchmark for 4bit/8bit embedding lookups
Summary: Add operator benchmark for 4bit/8bit embedding lookups in `aibench`.

Test Plan:
```
buck build //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test
aibench-cli adhoc -c 'buck run //caffe2/benchmarks/operator_benchmark/pt:qembedding_bag_lookups_test'
````

The run was successful in aibench: https://www.internalfb.com/intern/aibench/details/738300474
https://www.internalfb.com/intern/aibench/details/346463246

Reviewed By: radkris-git

Differential Revision: D24268413

fbshipit-source-id: 7fb4ff75da47f8f327edab562c5d29bb69e00b8d
2020-10-15 13:51:32 -07:00
Bert Maher
b7261de0df [pytorch][te] Add compilation time benchmark (#46124)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46124

We want to make sure we can actually fuse kernels within a fairly
tight time budget.  So here's a quick benchmark of codegen for a simple
pointwise activation function (swish).  I kept all the intermediate tensors
separate to force TE to actually do inlining.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
```

I've only run in debug mode so results aren't super meaningful, but even in
that mode it's 18ms for compilation, 15 of which are in llvm.

Update, opt build mode:
```
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_CompileSwish                         5123276 ns    5119846 ns        148
BM_CompileSwishLLVMOnly                 4754361 ns    4753701 ns        160
```

Reviewed By: asuhan

Differential Revision: D24232801

fbshipit-source-id: d58a8b7f79bcd9244c49366af7a693e09f24bf76
2020-10-09 23:11:37 -07:00
shmsong
43fe45ab0f [JIT] Add dynamic shape benchmark for NV Fuser (#46107)
Summary:
This PR modifies `benchmarks/tensorexpr`. It follows up[ https://github.com/pytorch/pytorch/issues/44101](https://github.com/pytorch/pytorch/pull/44101) and further supports characterizing fusers with dynamic shape benchmarks. Dynamic shape condition models the use case when the input tensor shape changes in each call to the graph.

Changes include:

Added an auxiliary class `DynamicShape `that provides a simple API for enabling dynamic shapes in existing test cases, example can be found with `DynamicSimpleElementBench`

Created new bench_cls: `DynamicSimpleElementBench`, `DynamicReduce2DInnerBench`, `DynamicReduce2DOuterBench`, and `DynamicLSTM`. They are all dynamic shaped versions of existing benchmarks and examples of enabling dynamic shape with `DynamicShape`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46107

Reviewed By: glaringlee

Differential Revision: D24229400

Pulled By: bertmaher

fbshipit-source-id: 889fece5ea87d0f6f6374d31dbe11b1cd1380683
2020-10-09 22:09:21 -07:00
Supriya Rao
31888b2e77 [quant][pyper] Rename the sparse argument for embedding_bag ops (#46003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46003

sparse is confusing because itt is used in training for sparse gradients

Test Plan: Imported from OSS

Reviewed By: radkris-git, qizzzh

Differential Revision: D24178248

fbshipit-source-id: 0a2b595f3873d33b2ce25839b6eee31d2bfd3b0d
2020-10-08 16:15:28 -07:00
Shijun Kong
7d4f5060ad Fix doc about operator benchmark (#45853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45853

The method name in README is not consistent with actual implementation.

Reviewed By: qizzzh

Differential Revision: D24114849

fbshipit-source-id: d979e324c768708e99b8cc5b87e261f17c22a883
2020-10-08 09:13:53 -07:00
Bert Maher
f2e569461b [te] Tiled (m=32 x n=32) gemm benchmark (#45905)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45905

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142402

Pulled By: bertmaher

fbshipit-source-id: b39e18b6985ee1c1f654fba4498ed91ff14d8d5f
2020-10-06 16:57:31 -07:00
Bert Maher
50f89578dd [te] Add a benchmark harness (#45875)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45875

Adds a googlebenchmark harness for perf testing programs generated by
tensorexpr, sans any pytorch wrappings (for python-level benchmarks of
tensorexpr, see benchmarks/tensorexpr).

Currently there's a harness for gemm that sets up the problem using torch (and
also measures the perf of a torch::mm to give a baseline).

Right now there's just an unoptimized implementation that is expected to be not
very fast.  More optimized versions are coming.

Sample output from my dev box:
```
Run on (48 X 2501 MHz CPU s)
CPU Caches:
  L1 Data 32K (x24)
  L1 Instruction 32K (x24)
  L2 Unified 256K (x24)
  L3 Unified 30720K (x2)
--------------------------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
Gemm/Torch/128/128/128                    73405 ns      73403 ns       8614 GFLOPS=57.1411G/s
Gemm/TensorExprNoopt/128/128/128        3073003 ns    3072808 ns        229 GFLOPS=1.36497G/s
```

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D24142403

Pulled By: bertmaher

fbshipit-source-id: 3354aaa56868a43a553acd1ad9a192f28d8e3597
2020-10-06 16:57:27 -07:00
Mingzhe Li
e829d4fba9 [op-bench] fix jit mode (#45774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45774

Fix RuntimeError: No such operator operator_benchmark::_consume

Test Plan: waitforsandcastle

Reviewed By: ngimel

Differential Revision: D24064982

fbshipit-source-id: 13160b6d18569e659ca1ab0ca1d444ed9947260c
2020-10-05 09:29:41 -07:00
Hao Lu
2b48dd168d [StaticRuntime] Integrate Static Runtime into PyTorchPredictor (#45640)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45640

Reviewed By: dzhulgakov

Differential Revision: D23996656

fbshipit-source-id: 63d88c89d1df61a04deadc472319607ed83867e5
2020-10-02 23:03:05 -07:00
Ilia Cherniavskii
f5c95d5cf1 Source code level attribution in profiler (#43898)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43898

Adding with_source parameter to enable tracking source code
(filename and line) in profiler for eager, torchscript and autograd
modes

Test Plan:
python test/test_profiler.py
```
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Source Location
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  --------------------------------------------
ts_method_1                          10.43%           235.364us        36.46%           822.920us        822.920us        1                test/test_profiler.py(70): test_source
aten::add                            7.52%            169.833us        8.88%            200.439us        200.439us        1                test/test_profiler.py(69): test_source
aten::normal_                        6.26%            141.380us        6.26%            141.380us        141.380us        1                test/test_profiler.py(67): test_source
aten::add                            5.80%            130.830us        8.41%            189.800us        63.267us         3                test/test_profiler.py(72): test_source
aten::sum                            5.02%            113.340us        8.39%            189.475us        189.475us        1                test/test_profiler.py(64): ts_method_1
aten::add                            4.58%            103.346us        6.33%            142.847us        142.847us        1                test/test_profiler.py(62): ts_method_1
aten::mul                            4.05%            91.498us         9.62%            217.113us        217.113us        1                test/test_profiler.py(71): test_source
aten::add                            4.03%            90.880us         5.60%            126.405us        126.405us        1                test/test_profiler.py(58): ts_method_2
aten::empty                          3.49%            78.735us         3.49%            78.735us         19.684us         4                test/test_profiler.py(72): test_source
```

Reviewed By: ngimel

Differential Revision: D23432664

Pulled By: ilia-cher

fbshipit-source-id: 83ad7ebe0c2502494d3b48c4e687802db9c77615
2020-09-30 00:57:35 -07:00
Taylor Robie
ccad73ab41 Fix D23995953 import.
Summary: https://github.com/pytorch/pytorch/pull/45511 could not be properly imported

Test Plan: See https://github.com/pytorch/pytorch/pull/45511

Reviewed By: zhangguanheng66

Differential Revision: D23995953

fbshipit-source-id: a6224a67d54617ddf34c2392e65f2142c4e78ea4
2020-09-29 19:30:23 -07:00
Bram Wasti
87b356d093 [static runtime] Split out graph preparation from runtime (#44131)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44131

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604305

Pulled By: bwasti

fbshipit-source-id: 7b47da4961d99074199417ef1407a788c7d80ee6
2020-09-28 13:01:23 -07:00
Mikhail Zolotukhin
bc5710f2f7 Benchmarks: tweak PE config settings. (#45349)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45349

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935518

Pulled By: ZolotukhinM

fbshipit-source-id: 5a7c508c6fc84eafbc23399f095d732b903510dc
2020-09-26 23:13:29 -07:00
Mikhail Zolotukhin
8cef7326f4 Benchmarks: add 'default' options for fuser and executor. (#45347)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45347

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D23935519

Pulled By: ZolotukhinM

fbshipit-source-id: 8323fafe7828683c4d29c12a1e5722adb6f945ff
2020-09-26 23:09:02 -07:00
Bram Wasti
e5f6e5af13 Add Deep and wide to test and flatten/tranpose for good measure (#44129)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44129

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604302

Pulled By: bwasti

fbshipit-source-id: 5787f6f32a80b22b1b712c4116f70370dad98f12
2020-09-25 11:05:41 -07:00
Bram Wasti
d1a11618f5 [static runtime] Add _out variants and reuse memory (#44128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44128

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D23604304

Pulled By: bwasti

fbshipit-source-id: 06a23cb75700a0fc733069071843b7b498e7b9e9
2020-09-25 11:03:06 -07:00
anjali411
58b6ab69e5 torch.sgn for complex tensors (#39955)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39955

resolves https://github.com/pytorch/pytorch/issues/36323 by adding `torch.sgn` for complex tensors.
`torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0`

This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23460526

Pulled By: anjali411

fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
2020-09-22 08:24:53 -07:00
Xiang Gao
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
Kevin Stephano
26a91a9f04 [WIP][JIT] Add benchmarking support of NV Fuser with FP16 dtype support (#44101)
Summary:
Modified files in `benchmarks/tensorexpr` to add support for NVIDIA's Fuser for the jit compiler.

This support has some modifications besides adding an option to support the NVIDIA fuser:

* Adds FP16 Datatype support
* Fixes SOL/Algo calculations to generally use the data type instead of being fixed to 4 bytes
* Adds IR printing and kernel printing knobs
* Adds a knob `input_iter` to create ranges of inputs currently only for reductions
* Adds further reduction support for Inner and Outer dimension reductions that are compatible with the `input_iter` knob.
* Added `simple_element`, `reduce2d_inner`, and `reduce2d_outer` to isolate performance on elementwise  and reduction operations in the most minimal fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44101

Reviewed By: ngimel

Differential Revision: D23713658

Pulled By: bertmaher

fbshipit-source-id: d6b83cfab559aefe107c23b3c0f2df9923b3adc1
2020-09-15 15:10:49 -07:00
Mikhail Zolotukhin
37093f4d99 Benchmarks: make fuser and executor configurable from command line. (#44291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23569089

Pulled By: ZolotukhinM

fbshipit-source-id: ec25b2f0bba303adaa46c3e85b1a9ce4fa3cf076
2020-09-09 11:59:35 -07:00
Mikhail Zolotukhin
6134ac17ba Revert D23561500: Benchmarks: re-enable profiling-te configuration (try 2).
Test Plan: revert-hammer

Differential Revision:
D23561500 (589a2024c8)

Original commit changeset: 7fe86d34afa4

fbshipit-source-id: 10e48f230402572fcece56662ad4413ac0bd3cb5
2020-09-07 19:10:30 -07:00
Mikhail Zolotukhin
589a2024c8 Benchmarks: re-enable profiling-te configuration (try 2). (#44270)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44270

The previous PR (#44212) was reverted since I didn't update the
`upload_scribe.py` script and it was looking for 'executor_and_fuser'
field in the json which now is replaced with two separate fields:
'executor' and 'fuser'.

Differential Revision: D23561500

Test Plan: Imported from OSS

Reviewed By: ngimel

Pulled By: ZolotukhinM

fbshipit-source-id: 7fe86d34afa488a0e43d5ea2aaa7bc382337f470
2020-09-07 15:50:39 -07:00
Natalia Gimelshein
626e410e1d Revert D23544563: Benchmarks: re-enable profiling-te configuration.
Test Plan: revert-hammer

Differential Revision:
D23544563 (ac1f471fe2)

Original commit changeset: 98659e8860fa

fbshipit-source-id: 5dab7044699f59c709e64d178758f5f462ebb788
2020-09-06 21:01:19 -07:00
Mikhail Zolotukhin
ac1f471fe2 Benchmarks: re-enable profiling-te configuration. (#44212)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44212

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23544563

Pulled By: ZolotukhinM

fbshipit-source-id: 98659e8860fa951d142e0f393731c4a769463c6c
2020-09-06 10:22:16 -07:00
Mikhail Zolotukhin
d0421ff1cc Benchmarks: add scripts for FastRNNs results comparison. (#44134)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44134

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23505810

Pulled By: ZolotukhinM

fbshipit-source-id: d0b3d70d4c2a44a8c3773631d09a25a98ec59370
2020-09-03 13:44:42 -07:00
Mikhail Zolotukhin
d11603de38 [TensorExpr] Benchmarks: set number of profiling runs to 2 for PE. (#44112)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44112

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D23500904

Pulled By: ZolotukhinM

fbshipit-source-id: d0dd54752b7ea5ae11f33e865c96d2d61e98d573
2020-09-03 11:29:35 -07:00
Bert Maher
33d51a9b32 Respect canFuseOn{CPU,GPU} in TE fuser (#43967)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43967

Test Plan: Imported from OSS

Reviewed By: asuhan

Differential Revision: D23469048

Pulled By: bertmaher

fbshipit-source-id: 1005a7ae08974059ff9d467492caa3a388070eeb
2020-09-02 18:00:25 -07:00
taivu
8722952dbd Add benchmark for channel_shuffle operator (#43509)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43509

Test Plan: Imported from OSS

Reviewed By: kimishpatel

Differential Revision: D23299972

Pulled By: kimishpatel

fbshipit-source-id: 6189d209859da5a41067eb9e8317e3bf7a0fc754
2020-09-02 08:15:19 -07:00
Bram Wasti
6512032699 [Static Runtime] Add OSS build for static runtime benchmarks (#43881)
Summary:
Adds CMake option.  Build with:

```
BUILD_STATIC_RUNTIME_BENCHMARK=ON python setup.py install
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43881

Reviewed By: hlu1

Differential Revision: D23430708

Pulled By: bwasti

fbshipit-source-id: a39bf54e8d4d044a4a3e4273a5b9a887daa033ec
2020-09-02 08:00:18 -07:00
Hao Lu
8538a79bfe [jit][static] Basic executor (#43647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43647

Nothing fancy, just a basic implementation of the graph executor without using stack machine.

Reviewed By: bwasti

Differential Revision: D23208413

fbshipit-source-id: e483bb6ad7ba8591bbe1767e669654d82f42c356
2020-08-28 23:20:07 -07:00
Mikhail Zolotukhin
c1553ff94b Benchmarks: temporarily disable profiling-te configuration. (#43603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43603

We are in the midst of landing a big reword of profiling executor and
benchmarks are expected to fail while we are in the transitional state.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D23334818

Pulled By: ZolotukhinM

fbshipit-source-id: 99ff17c6f8ee18d003f6ee76ff0e719cea68c170
2020-08-25 21:00:10 -07:00
albanD
e08e93f946 Reland of benchmark code (#43428)
Summary:
Reland of the benchmark code that broke the slow tests because the GPU were running out of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43428

Reviewed By: ngimel

Differential Revision: D23296136

Pulled By: albanD

fbshipit-source-id: 0002ae23dc82f401604e33d0905d6b9eedebc851
2020-08-24 13:27:26 -07:00
Supriya Rao
7024ce8a2c [quant] Add benchmarks for quantized embeddingbag module (#43296)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43296

Use common config for float and quantized embedding_bag modules

Test Plan:
```
python -m pt.qembeddingbag_test

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 35.738

 Benchmarking PyTorch: qEmbeddingBag
 Mode: Eager
 Name: qEmbeddingBag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 62.708

python -m pt.embeddingbag_test

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetTrue_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: True, device: cpu
Forward Execution Time (us) : 46.878

 Benchmarking PyTorch: embeddingbag
 Mode: Eager
 Name: embeddingbag_embeddingbags10_dim4_modesum_input_size8_offset0_sparseTrue_include_last_offsetFalse_cpu
 Input: embeddingbags: 10, dim: 4, mode: sum, input_size: 8, offset: 0, sparse: True, include_last_offset: False, device: cpu
Forward Execution Time (us) : 103.904

```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D23245531

fbshipit-source-id: 81b44fde522238d3eef469434e93dd7f94b528a8
2020-08-24 09:51:03 -07:00
Alban Desmaison
74781ab5b8 Revert D23242101: [pytorch][PR] Implement first draft of autograd benchmark.
Test Plan: revert-hammer

Differential Revision:
D23242101 (c2511bdfa4)

Original commit changeset: a2b92d5a4341

fbshipit-source-id: bda562d15565f074b448022d180ec8f959c6ecc9
2020-08-21 12:22:57 -07:00
albanD
c2511bdfa4 Implement first draft of autograd benchmark. (#40586)
Summary:
It is quite a lot of code because I pulled some code from torchaudio and torchvision to remove issues I had to get latest version with pytorch built from source while I can't build there libs from source (dependency missing for torchaudio).

The compare script generates table as follows:
| model | task | speedup | mean (before) | var (before) | mean (after) | var (after) |
| -- | -- | -- | -- | -- | -- | -- |
| resnet18 | vjp | 1.021151844124464 | 1.5627719163894653 | 0.005164200905710459 | 1.5304011106491089 | 0.003979875706136227 |
| resnet18 | vhp | 0.9919114430761606 | 6.8089728355407715 | 0.019538333639502525 | 6.86449670791626 | 0.014775685034692287 |
| resnet18 | jvp | 0.9715963084255123 | 5.720699310302734 | 0.08197150379419327 | 5.887938499450684 | 0.018408503383398056 |
| ppl_simple_reg | vjp | 0.9529183269165618 | 0.000362396240234375 | 7.526952949810095e-10 | 0.00038030146970413625 | 7.726220357939795e-11 |
| ppl_simple_reg | vhp | 0.9317708619586977 | 0.00048058031825348735 | 5.035701855504726e-10 | 0.0005157709238119423 | 3.250243477137538e-11 |
| ppl_simple_reg | jvp | 0.8609755877018406 | 0.00045447348384186625 | 9.646707044286273e-11 | 0.0005278587341308594 | 1.4493808930815533e-10 |
| ppl_simple_reg | hvp | 0.9764100147808232 | 0.0005881547695025802 | 7.618464747949361e-10 | 0.0006023645401000977 | 6.370915461850757e-10 |
| ppl_simple_reg | jacobian | 1.0019173715134297 | 0.0003612995205912739 | 2.2979899233499523e-11 | 0.0003606081008911133 | 1.2609764794835332e-11 |
| ppl_simple_reg | hessian | 1.0358429970264393 | 0.00206911563873291 | 2.590938796842579e-09 | 0.0019975185859948397 | 2.8916853356264482e-09 |
| ppl_robust_reg | vjp | 1.0669910916521521 | 0.0017304659122601151 | 3.1047047155396967e-09 | 0.0016218185191974044 | 4.926861585374809e-09 |
| ppl_robust_reg | vhp | 1.0181130455462972 | 0.0029563189018517733 | 2.6359153082466946e-08 | 0.0029037236236035824 | 1.020585038702393e-08 |
| ppl_robust_reg | jvp | 0.9818360373406179 | 0.0026934861671179533 | 6.981357714153091e-09 | 0.00274331565015018 | 3.589908459389335e-08 |
| ppl_robust_reg | hvp | 1.0270848910527002 | 0.005576515104621649 | 3.2798087801211295e-08 | 0.005429458804428577 | 6.438724398094564e-08 |
| ppl_robust_reg | jacobian | 1.0543611284155785 | 0.00167675013653934 | 2.3236829349571053e-08 | 0.001590299652889371 | 1.2011492245278532e-08 |
| ppl_robust_reg | hessian | 1.0535378727082656 | 0.01643357239663601 | 1.8450685956850066e-06 | 0.015598463825881481 | 2.1876705602608126e-07 |
| wav2letter | vjp | 1.0060408105086573 | 0.3516994118690491 | 1.4463969819189515e-05 | 0.349587619304657 | 9.897866402752697e-05 |
| wav2letter | vhp | 0.9873655295086051 | 1.1196287870407104 | 0.00474404776468873 | 1.133955717086792 | 0.009759620763361454 |
| wav2letter | jvp | 0.9741820317882822 | 0.7888165712356567 | 0.0017476462526246905 | 0.8097219467163086 | 0.0018235758179798722 |
| transfo | vjp | 0.9883954031921641 | 2.8865864276885986 | 0.008410997688770294 | 2.9204773902893066 | 0.006901870481669903 |
| transfo | vhp | 1.0111290842971339 | 8.374398231506348 | 0.014904373325407505 | 8.282224655151367 | 0.04449500888586044 |
| transfo | jvp | 1.0080534543381963 | 6.293097972869873 | 0.03796082362532616 | 6.24282169342041 | 0.010179692879319191 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40586

Reviewed By: pbelevich

Differential Revision: D23242101

Pulled By: albanD

fbshipit-source-id: a2b92d5a4341fe1472711a685ca425ec257d6384
2020-08-21 07:36:26 -07:00
Supriya Rao
4fc9e958c4 [quant] Add benchmakrs for embedding_bag coversion ops (#43291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43291

Test Float2Fused and Fused2Float conversion operators for embedding_bag byte and 4-bit ops

Test Plan:
```
python -m pt.qembedding_pack_tes
```

Imported from OSS

Reviewed By: radkris-git

Differential Revision: D23231641

fbshipit-source-id: a2afe51bba52980d2e96dfd7dbc183327e9349fd
2020-08-20 11:26:20 -07:00
Vasiliy Kuznetsov
5aa61afbfb quant bench: update observer configs (#42956)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42956

In preparation for observer perf improvement, cleans up the
micro benchmarks:
* disable CUDA for histogram observers (it's too slow)
* add larger shapes for better representation of real workloads

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qobserver_test
```

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23093996

fbshipit-source-id: 5dc477c9bd5490d79d85ff8537270cd25aca221a
2020-08-17 17:07:56 -07:00
Hao Lu
8864148823 [jit] DeepAndWide benchmark (#43096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43096

Add benchmark script for deep and wide model.

Reviewed By: bwasti, yinghai

Differential Revision: D23099925

fbshipit-source-id: aef09d8606eba1eccc0ed674dfea59b890d3648b
2020-08-15 01:27:12 -07:00
Paul Shao
8b5642a786 Fix to Learnable Fake Quantization Op Benchmarking (#43018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43018

In this diff, a fix is added where the original non-learnable fake quantize is provided with trainable scale and zero point, whereas the requires_grad for both parameters should be completely disabled.

Test Plan:
Use the following command to execute the benchmark test:

`buck test mode/dev-nosan pt:quantization_test`

Reviewed By: vkuzo

Differential Revision: D23107846

fbshipit-source-id: d2213983295f69121e9e6ae37c84d1f37d78ef39
2020-08-13 16:32:13 -07:00
Bert Maher
eb47940c0a Add executor and fuser options to the fastrnn test fixture (#42946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42946

There are 3 options for the executor and fuser and some of them aren't
super interesting so I've combined the options into a single parameter, but
made it fairly easy to expand the set if there are other configs we might care
about.

Test Plan:
Benchmark it

Imported from OSS

Reviewed By: zheng-xq

Differential Revision: D23090177

fbshipit-source-id: bd93a93c3fc64e5a4a847d1ce7f42ce0600a586e
2020-08-13 12:45:37 -07:00
Bert Maher
b8ae563ce6 Add a microbenchmark for LSTM elementwise portion (#42901)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42901

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079714

Pulled By: bertmaher

fbshipit-source-id: 28f8c3b5019ee898e82e64a0a674da1b4736d252
2020-08-12 17:11:47 -07:00
Bert Maher
33d209b5f4 Fix TE microbenchmark harness to use appropriate fuser/executor (#42900)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42900

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D23079715

Pulled By: bertmaher

fbshipit-source-id: 6aa2b08a550835b7737e355960a16a7ca83878ea
2020-08-12 17:11:44 -07:00
Vasiliy Kuznetsov
57b056b5f2 align qlinear benchmark to linear benchmark (#42767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42767

Same as previous PR, forcing the qlinear benchmark to follow the fp one

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23013937

fbshipit-source-id: fffaa7cfbfb63cea41883fd4d70cd3f08120aaf8
2020-08-11 10:35:16 -07:00
Vasiliy Kuznetsov
a7bdf575cb align qconv benchmark to conv benchmark (#42761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42761

Makes the qconv benchmark follow the conv benchmark exactly. This way
it will be easy to compare q vs fp with the same settings.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test
python -m pt.conv_test
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D23012533

fbshipit-source-id: af30ee585389395569a6322f5210828432963077
2020-08-11 10:33:19 -07:00