albanD
9920ae665b
Make te a hidden package for now ( #51690 )
...
Summary:
As discussed with suo , having it in `torch._C.XX` means that it automatically gets added to `torch.XX` which is unfortunate. Making it `torch._C._XX` means that it won't be added to `torch.`.
Let me know if that approach to hide it is not good and we can update that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51690
Reviewed By: gchanan
Differential Revision: D26243207
Pulled By: albanD
fbshipit-source-id: 3eb91a96635e90a6b98df799e3a732833dd280d5
2021-02-04 07:58:38 -08:00
Bert Maher
a23e82df10
[nnc] Tweak log_nnc_sleef so vectorization kicks in ( #51491 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51491
The vectorizer heuristic is pretty dumb and only kicks in if the
unroll factor is exactly 8 or 4.
It's still slower than direct implementation, which isn't surprising.
ghstack-source-id: 120783426
Test Plan:
`buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench`
Before:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64 438 ns 438 ns 1795511 log/s=146.259M/s
log_nnc_sleef/512 3196 ns 3195 ns 210032 log/s=160.235M/s
log_nnc_sleef/8192 77467 ns 77466 ns 8859 log/s=105.749M/s
log_nnc_sleef/32768 310206 ns 310202 ns 2170 log/s=105.634M/s
log_nnc_fast/64 100 ns 100 ns 7281074 log/s=637.144M/s
log_nnc_fast/512 546 ns 546 ns 1335816 log/s=938.361M/s
log_nnc_fast/8192 7360 ns 7359 ns 91971 log/s=1.11316G/s
log_nnc_fast/32768 30793 ns 30792 ns 22633 log/s=1064.17M/s
log_aten/64 427 ns 427 ns 1634897 log/s=150.021M/s
log_aten/512 796 ns 796 ns 877318 log/s=643.566M/s
log_aten/8192 6690 ns 6690 ns 102649 log/s=1.22452G/s
log_aten/32768 25357 ns 25350 ns 27808 log/s=1.29263G/s
```
After:
```
---------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------
log_nnc_sleef/64 189 ns 188 ns 3872475 log/s=340.585M/s
log_nnc_sleef/512 1307 ns 1307 ns 557770 log/s=391.709M/s
log_nnc_sleef/8192 20259 ns 20257 ns 34240 log/s=404.404M/s
log_nnc_sleef/32768 81556 ns 81470 ns 8767 log/s=402.209M/s
log_nnc_fast/64 110 ns 110 ns 6564558 log/s=581.116M/s
log_nnc_fast/512 554 ns 554 ns 1279304 log/s=923.376M/s
log_nnc_fast/8192 7774 ns 7774 ns 91421 log/s=1053.75M/s
log_nnc_fast/32768 31008 ns 31006 ns 21279 log/s=1056.83M/s
```
Reviewed By: bwasti
Differential Revision: D26139067
fbshipit-source-id: db31897ee9922695ff9dff4ff46e3d3fbd61f4c2
2021-02-01 16:35:37 -08:00
Marat Subkhankulov
721ba97eb6
Create op benchmark for stack ( #51263 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51263
- Add benchmark for stack op
Test Plan:
```
buck build mode/opt //caffe2/benchmarks/operator_benchmark/pt:stack_test --show-output
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/stack_test.par --tag_filter=static_runtime | grep Execution
Forward Execution Time (us) : 6.380
Forward Execution Time (us) : 6.553
Forward Execution Time (us) : 14.904
Forward Execution Time (us) : 5.657
Forward Execution Time (us) : 5.612
Forward Execution Time (us) : 6.051
Forward Execution Time (us) : 4.225
Forward Execution Time (us) : 4.240
Forward Execution Time (us) : 6.280
Forward Execution Time (us) : 6.267
Forward Execution Time (us) : 418.932
Forward Execution Time (us) : 417.694
Forward Execution Time (us) : 1592.455
Forward Execution Time (us) : 2919.261
Forward Execution Time (us) : 211.458
Forward Execution Time (us) : 211.518
Forward Execution Time (us) : 783.953
Forward Execution Time (us) : 1457.823
Forward Execution Time (us) : 2032.816
Forward Execution Time (us) : 2090.662
Forward Execution Time (us) : 6487.098
Forward Execution Time (us) : 11874.702
Forward Execution Time (us) : 2123.830
Forward Execution Time (us) : 2195.453
Forward Execution Time (us) : 6435.978
Forward Execution Time (us) : 11852.205
Forward Execution Time (us) : 2036.526
Forward Execution Time (us) : 2055.618
Forward Execution Time (us) : 6417.192
Forward Execution Time (us) : 12468.744
Forward Execution Time (us) : 4959.704
Forward Execution Time (us) : 5121.823
Forward Execution Time (us) : 5082.105
Forward Execution Time (us) : 5395.936
Forward Execution Time (us) : 5162.756
Forward Execution Time (us) : 23798.080
Forward Execution Time (us) : 4957.921
Forward Execution Time (us) : 4971.234
Forward Execution Time (us) : 5005.909
Forward Execution Time (us) : 5159.614
Forward Execution Time (us) : 5013.221
Forward Execution Time (us) : 20238.741
Forward Execution Time (us) : 7632.439
Forward Execution Time (us) : 7589.376
Forward Execution Time (us) : 7859.937
Forward Execution Time (us) : 8214.213
Forward Execution Time (us) : 11606.562
Forward Execution Time (us) : 34612.919
```
Reviewed By: hlu1
Differential Revision: D25859143
fbshipit-source-id: a1b735ce87f57b5eb67e223e549248a2cd7663c1
2021-01-30 10:32:14 -08:00
Hao Lu
11cda929fb
[StaticRuntime] Fix bug in MemoryPlanner ( #51342 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51342
There is a subtle bug with the MemoryPlanner with regard to view ops with out variant.
```
def forward(self, a: Tensor, shape: List[int]):
b = a.reshape(shape)
return b + b
```
In this case, if we replace reshape with the out variant, b would be managed by the MemoryPlanner and the storage of its output would have been set to nullptr right after inference by the MemoryPlanner if opts.cleanup_activations is true. Because b is a view of a, the storage of a is also set to nullptr, and this violates the API which promises that a is const.
To fix this bug, I changed the MemoryPlanner so that it puts b in the unmanaged part.
Test Plan:
Add unit test to enforce the constness of inputs
```
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
Reviewed By: ajyu
Differential Revision: D26144203
fbshipit-source-id: 2dbacccf7685d0fe0f0b1195166e0510b2069fe3
2021-01-29 21:16:02 -08:00
Rohan Varma
5021582fe6
Fix benchmarks/distributed/ddp/benchmark.py ( #51095 )
...
Summary:
Fixes the issue reported in https://github.com/pytorch/pytorch/issues/50679 by using built-in object-based collectives. User has verified this patch works
Test with:
RANK=0 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456
RANK=1 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51095
Reviewed By: SciPioneer
Differential Revision: D26070275
Pulled By: rohan-varma
fbshipit-source-id: 59abcaac9e395bcdd8a018bf6ba07521d94b2fdf
2021-01-29 11:10:13 -08:00
Pritam Damania
96cedefd8e
[Pipe] Refactor convert_to_balance under non-test package. ( #50860 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50860
Since fairscale.nn.Pipe still uses 'balance' and 'devices' parameters,
other frameworks like fairseq still use these parameters. As a result, the
`convert_to_balance` method is a nice utility to use for migrating to PyTorch
Pipe without changing a lot of code in other frameworks.
In addition to this I've renamed the method to be more illustrative of what it
does and also allowed an optional devices parameter.
ghstack-source-id: 120430775
Test Plan:
1) waitforbuildbot
2) Tested with fairseq
Reviewed By: SciPioneer
Differential Revision: D25987273
fbshipit-source-id: dccd42cf1a74b08c876090d3a10a94911cc46dd8
2021-01-28 12:10:21 -08:00
Hao Lu
d035d56bfb
[StaticRuntime] Add out variant for reshape and flatten ( #51249 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51249
- Add out variant for reshape and flatten. reshape and flatten only create tensor views when it can. In cases where it can't, it does a copy. The out variant reuses the TensorImpl for both cases. The difference is that the TensorImpl is a view in the first case, but a normal TensorImpl in the second case.
- Create a separate registry for the view ops with out variants. Because Tensor views can't participate in memory reuse (memonger), we need to track these ops separately.
- The MemoryPlanner does not track the StorageImpl of tensor views because they don't own the storage, however, in cases where reshape does not create a view, the MemoryPlanner does manage the output tensor.
Reviewed By: ajyu
Differential Revision: D25992202
fbshipit-source-id: dadd63b78088c129e491d78abaf8b33d8303ca0d
2021-01-27 22:44:11 -08:00
Vasiliy Kuznetsov
983b8e6b62
fake_quant: add a more memory efficient version ( #50561 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50561
Not for review yet, a bunch of TODOs need finalizing.
tl;dr; add an alternative implementation of `fake_quantize` which saves
a ask during the forward pass and uses it to calculate the backward.
There are two benefits:
1. the backward function no longer needs the input Tensor, and it can be
gc'ed earlier by autograd. On MobileNetV2, this reduces QAT overhead
by ~15% (TODO: link, and absolute numbers). We add an additional mask Tensor
to pass around, but its size is 4x smaller than the input tensor. A
future optimization would be to pack the mask bitwise and unpack in the
backward.
2. the computation of `qval` can be done only once in the forward and
reused in the backward. No perf change observed, TODO verify with better
matrics.
TODO: describe in more detail
Test Plan:
OSS / torchvision / MobileNetV2
```
python references/classification/train_quantization.py
--print-freq 1
--data-path /data/local/packages/ai-group.imagenet-256-smallest-side/prod/
--output-dir ~/nfs/pytorch_vision_tests/
--backend qnnpack
--epochs 5
TODO paste results here
```
TODO more
Imported from OSS
Reviewed By: ngimel
Differential Revision: D25918519
fbshipit-source-id: ec544ca063f984de0f765bf833f205c99d6c18b6
2021-01-27 19:36:04 -08:00
Mikhail Zolotukhin
e975169426
[TensorExpr] Redesign Tensor class. ( #50995 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50995
This change makes 'Tensor' a thin wrapper over 'Buf' and 'Stmt', and
merges it with recently introduced 'CompoundTensor'. A statement for the
tensor is either passed directly to the Tensor constructor (akin to
'CompoundTensor'), or is built immediately in constructor.
LoopNest is no longer responsible for constructing statements from
tensors - it simply stitches already constructed statements contained in
Tensors. This has a side effect that now we cannot construct several
loopnests from the same tensors - we need to explicitly clone statements
if we want to do that. A special copy constructor was added to LoopNest
to make it more convenient (note: this only affects tests, we don't
usually create multiple loopnests in other places).
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D26038223
Pulled By: ZolotukhinM
fbshipit-source-id: 27a2e5900437cfb0c151e8f89815edec53608e17
2021-01-27 16:14:22 -08:00
Nikita Shulga
97ea95ddd7
Delete tabs from becnh_approx.cpp ( #51157 )
...
Summary:
Introduced by D25981260 (f08464f31d )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51157
Reviewed By: bwasti
Differential Revision: D26090008
Pulled By: malfet
fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
2021-01-26 15:53:47 -08:00
Bert Maher
c4029444d1
[nnc] Per-operator benchmarks ( #51093 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51093
Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels. We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).
I threw in a composed hardswish for fun, because it's my favorite activation
function.
Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance. Fix incoming.
This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675
Test Plan:
```
op eager nnc speedup
hardswish 0.187 0.051 3.70
hardswish 0.052 0.052 1.00
sigmoid 0.148 1.177 0.13
reciprocal 0.049 0.050 0.98
neg 0.038 0.037 1.02
relu 0.037 0.036 1.03
isnan 0.119 0.020 5.86
log 0.082 1.330 0.06
log10 0.148 1.848 0.08
log1p 0.204 1.413 0.14
log2 0.285 1.167 0.24
exp 0.063 1.123 0.06
expm1 0.402 1.417 0.28
erf 0.167 0.852 0.20
erfc 0.181 1.098 0.16
cos 0.124 0.793 0.16
sin 0.126 0.838 0.15
tan 0.285 1.777 0.16
acos 0.144 1.358 0.11
asin 0.126 1.193 0.11
cosh 0.384 1.761 0.22
sinh 0.390 2.279 0.17
atan 0.240 1.564 0.15
tanh 0.320 2.259 0.14
sqrt 0.043 0.069 0.63
rsqrt 0.118 0.117 1.01
abs 0.038 0.037 1.03
ceil 0.038 0.038 1.01
floor 0.039 0.039 1.00
round 0.039 0.292 0.13
trunc 0.040 0.036 1.12
lgamma 2.045 2.721 0.75
```
Reviewed By: zheng-xq
Differential Revision: D26069791
fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
2021-01-26 14:10:08 -08:00
Bram Wasti
f08464f31d
[nnc] Add benchmarks
...
Summary: Adding a set of benchmarks for key operators
Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench
Reviewed By: ZolotukhinM
Differential Revision: D25981260
fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
2021-01-26 13:51:33 -08:00
Horace He
4cca08368b
Adds per-op microbenchmarks for NNC ( #50845 )
...
Summary:
Runs through vast majority of primitive ops that exist in NNC and benchmarks them against PyTorch ops on CPU. Dumps out a plot like this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50845
Reviewed By: ngimel
Differential Revision: D25989080
Pulled By: Chillee
fbshipit-source-id: 6d6a39eb06b3de9a999993224d5e718537c0c8c4
2021-01-21 13:21:01 -08:00
Xiaoqiang Zheng
b96a6516a6
Add CPP Full Reduction Benchmarks. ( #50193 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50193
* Supports aten, native reference implementation, and NNC TE implementations.
* Support functionality checks against aten, in addition to performance checks.
Test plans:
* After enable "BUILD_TENSOREXPR_BENCHMARK" in CMakeLists.txt,
* bin/tensorexpr_bench --benchmark_filter=Reduce1D
Measurements:
On a Broadwell E5-2686 CPU,
Reduce1D/Torch/16777216 5638547 ns 5638444 ns 119 BYTES=11.902G/s
Reduce1D/Naive/16777216 19308235 ns 19308184 ns 36 BYTES=3.47567G/s
Reduce1D/NativeRfactor/16777216 8433348 ns 8433038 ns 85 BYTES=7.95785G/s
Reduce1D/NativeVector/16777216 5608836 ns 5608727 ns 124 BYTES=11.9651G/s
Reduce1D/NativeTiled/16777216 5550233 ns 5550221 ns 126 BYTES=12.0912G/s
Reduce1D/TeNaive/16777216 21451047 ns 21450752 ns 33 BYTES=3.12851G/s
Reduce1D/TeSplitTail/16777216 23701732 ns 23701229 ns 30 BYTES=2.83145G/s
Reduce1D/TeSplitMask/16777216 23683589 ns 23682978 ns 30 BYTES=2.83363G/s
Reduce1D/TeRfactorV2/16777216 5378019 ns 5377909 ns 131 BYTES=12.4786G/s
Result summary:
* The single-threaded performance with NNC TeRfactorV2 matches and exceeds Aten and avx2 naive counterpart.
Follow-up items:
* rfactor does not work well with split
* We don't have a multi-threaded implementation yet.
* Missing "parallel" scheduling primitive, which is not different from what we need for pointwise ops.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25821880
Pulled By: zheng-xq
fbshipit-source-id: 8df3f40d1eed8749c8edcaacae5f0544dbf6bed3
2021-01-21 10:00:50 -08:00
Xiaoqiang Zheng
88b36230f5
Add full reduction benchmark. ( #50057 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50057
As part of the effort to calibrate TE reduction performance, adding a full reduction benchmark.
Also add a "skip_input_transformation" option.
Fixed other reduction benchmarks to accept specific benchmarks that was listed.
Test plans:
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s1
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce_full_fwd_cpu_16777216_s0
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_inner_fwd_cpu_640_524288
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer
* python -m benchmarks.tensorexpr --device=cpu --mode=fwd reduce2d_outer_fwd_cpu_640_524288
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D25774138
Pulled By: zheng-xq
fbshipit-source-id: fd4598e5c29991be476e42235a059e8021d4f083
2021-01-21 09:56:46 -08:00
Marat Subkhankulov
dea9af5c06
Cat benchmark: use mobile feed tensor shapes and torch.cat out-variant ( #50778 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50778
- use tensor shapes from ctr_mobilefeed merge net
- use pt cat out-variant for a fairer comparison otherwise benchmark includes time to construct result tensor
Test Plan:
turbo off, devbig machine
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime
# Benchmarking Caffe2: concat
# Name: concat_sizes(1,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (1, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.619
# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,160),(1,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 160), (1, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.369
# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.590
# Benchmarking Caffe2: concat
# Name: concat_sizes[(1,580),(1,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(1, 580), (1, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 0.412
# Benchmarking Caffe2: concat
# Name: concat_sizes(20,40)_N5_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: (20, 40), N: 5, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 2.464
# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,160),(20,14)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 160), (20, 14)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 1.652
# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 9.312
# Benchmarking Caffe2: concat
# Name: concat_sizes[(20,580),(20,174)]_N-1_axis1_add_axis0_devicecpu_dtypefloat
# Input: sizes: [(20, 580), (20, 174)], N: -1, axis: 1, add_axis: 0, device: cpu, dtype: float
Forward Execution Time (us) : 6.532
```
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=static_runtime
```
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : static_runtime
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,160),(1,14)]_N-1_dim1_cpu
# Input: sizes: [(1, 160), (1, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.313
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,20,40),(1,4,40),(1,5,40)]_N-1_dim1_cpu
# Input: sizes: [(1, 20, 40), (1, 4, 40), (1, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.680
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(1,580),(1,174)]_N-1_dim1_cpu
# Input: sizes: [(1, 580), (1, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 3.452
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,160),(20,14)]_N-1_dim1_cpu
# Input: sizes: [(20, 160), (20, 14)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 4.653
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,20,40),(20,4,40),(20,5,40)]_N-1_dim1_cpu
# Input: sizes: [(20, 20, 40), (20, 4, 40), (20, 5, 40)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.364
# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[(20,580),(20,174)]_N-1_dim1_cpu
# Input: sizes: [(20, 580), (20, 174)], N: -1, dim: 1, device: cpu
Forward Execution Time (us) : 7.055
```
Reviewed By: hlu1
Differential Revision: D25839036
fbshipit-source-id: 7a6a234f41dfcc56246a80141fe0c84f769a5a85
2021-01-19 22:50:28 -08:00
Nikita Shulga
171f265d80
Back out "Revert D25717510: Clean up some type annotations in benchmarks/fastrnns" ( #50556 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50556
Original commit changeset: 2bcc19cd4340
Test Plan: Soft revert hammer
Reviewed By: walterddr, seemethere
Differential Revision: D25917129
fbshipit-source-id: e5caad77655789d607b84eee820aa7c960e00f51
2021-01-14 15:15:03 -08:00
Bert Maher
468c99fba4
Reapply D25856891: [te] Benchmark comparing fused overhead to unfused ( #50543 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50543
Original commit changeset: 2d2f07f79986
Was part of a stack that got reverted. This is just a benchmark.
ghstack-source-id: 119825594
Test Plan: CI
Reviewed By: navahgar
Differential Revision: D25912439
fbshipit-source-id: 5d9ca45810fff8931a3cfbd03965e11050180676
2021-01-14 14:17:45 -08:00
Mike Ruberry
2639f1d4a6
Revert D25717510: Clean up some type annotations in benchmarks/fastrnns
...
Test Plan: revert-hammer
Differential Revision:
D25717510 (7d0eecc666 )
Original commit changeset: 4f6431d140e3
fbshipit-source-id: 2bcc19cd434047f3857e0d7e804d34f72e566c30
2021-01-14 07:23:45 -08:00
Mike Ruberry
4ee631cdf0
Revert D25856891: [te] Benchmark comparing fused overhead to unfused
...
Test Plan: revert-hammer
Differential Revision:
D25856891 (36ae3feb22 )
Original commit changeset: 0e99515ec2e7
fbshipit-source-id: 2d2f07f79986ca7815b9eae63e734db76bdfc0c8
2021-01-14 04:33:35 -08:00
Nikita Shulga
a3f9cf9497
Fix fastrnn benchmark regression introduced by 49946 ( #50517 )
...
Summary:
Simply add missing `from typing import List, Tuple` and `from torch import Tensor`
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/49946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50517
Reviewed By: gchanan
Differential Revision: D25908379
Pulled By: malfet
fbshipit-source-id: a44b96681b6121e61b69f960f81c0cad3f2a8d20
2021-01-13 19:10:11 -08:00
Bert Maher
36ae3feb22
[te] Benchmark comparing fused overhead to unfused ( #50305 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50305
That's it
ghstack-source-id: 119631533
Test Plan:
```
buck run //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -- --benchmark_filter=Overhead
```
```
Run on (24 X 2394.67 MHz CPU s)
2021-01-08 16:06:17
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
FusedOverhead 2157 ns 2157 ns 311314
UnfusedOverhead 2443 ns 2443 ns 311221
```
Reviewed By: ZolotukhinM
Differential Revision: D25856891
fbshipit-source-id: 0e99515ec2e769a04929157d46903759c03182a3
2021-01-13 12:09:37 -08:00
Richard Barnes
7d0eecc666
Clean up some type annotations in benchmarks/fastrnns ( #49946 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49946
Upgrades type annotations from Python2 to Python3
Test Plan: Sandcastle tests
Reviewed By: xush6528
Differential Revision: D25717510
fbshipit-source-id: 4f6431d140e3032b4ca55587f9602aa0ea38c671
2021-01-13 09:57:14 -08:00
Marat Subkhankulov
49896c48e0
Caffe2 Concat operator benchmark ( #50449 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50449
Port caffe2 operator benchmark from torch.cat to caffe2 concat to measure the difference in performance.
previous diff abandoned to rerun github CI tests. D25738076
Test Plan:
Tested on devbig by running both pt and c2 benchmarks. Compiled with mode/opt
Inputs:
```
size, number of inputs, cat dimension, device
----------------------------------------------------
(1, 1, 1), N: 2, dim: 0, device: cpu
(512, 512, 2), N: 2, dim: 1, device: cpu
(128, 1024, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 0, device: cpu
(1025, 1023, 2), N: 2, dim: 1, device: cpu
(1024, 1024, 2), N: 2, dim: 2, device: cpu
[<function <lambda> at 0x7f922718e8c0>, 111, 65], N: 5, dim: 0, device: cpu
[96, <function <lambda> at 0x7f9226dad710>, 64], N: 5, dim: 1, device: cpu
[128, 64, <function <lambda> at 0x7f91a3625ef0>], N: 5, dim: 2, device: cpu
[<function <lambda> at 0x7f91a3625f80>, 32, 64], N: 50, dim: 0, device: cpu
[32, <function <lambda> at 0x7f91a3621050>, 64], N: 50, dim: 1, device: cpu
[33, 65, <function <lambda> at 0x7f91a36210e0>], N: 50, dim: 2, device: cpu
(64, 32, 4, 16, 32), N: 2, dim: 2, device: cpu
(16, 32, 4, 16, 32), N: 8, dim: 2, device: cpu
(9, 31, 5, 15, 33), N: 17, dim: 4, device: cpu
[<function <lambda> at 0x7f91a3621170>], N: 100, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621200>], N: 1000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621290>], N: 2000, dim: 0, device: cpu
[<function <lambda> at 0x7f91a3621320>], N: 3000, dim: 0, device: cpu
```
```
pytorch: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter=all
caffe2: MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 buck-out/gen/caffe2/benchmarks/operator_benchmark/c2/concat_test.par --tag_filter=all
```
```
Metric: Forward Execution Time (us)
pytorch | caffe2
--------------------------------
4.066 | 0.312
351.507 | 584.033
184.649 | 292.157
9482.895 | 6845.112
9558.988 | 6847.511
13730.016 | 14118.505
6324.371 | 4840.883
4613.497 | 3702.213
7504.718 | 7889.751
9882.978 | 7364.350
10087.076 | 7483.178
16849.556 | 18092.295
19181.075 | 13363.742
19296.508 | 13466.863
34157.449 | 56320.073
176.483 | 267.106
322.247 | 352.782
480.064 | 460.214
607.381 | 476.908
```
Reviewed By: hlu1
Differential Revision: D25890595
fbshipit-source-id: f53e125c0680bc2ebf722d1da5ec964bec585fdd
2021-01-12 18:27:44 -08:00
Oscar Sandoval
09f4844c1f
Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) ( #46901 )
...
Summary:
A Pytorch Distributed RPC benchmark measuring Agent and Observer Throughput and Latency for Reinforcement Learning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46901
Reviewed By: mrshenli
Differential Revision: D25869514
Pulled By: osandoval-fb
fbshipit-source-id: c3b36b21541d227aafd506eaa8f4e5f10da77c78
2021-01-11 19:02:36 -08:00
Fritz Obermeyer
093aca082e
Enable distribution validation if __debug__ ( #48743 )
...
Summary:
Fixes https://github.com/pytorch/pytorch/issues/47123
Follows https://github.com/pyro-ppl/pyro/pull/2701
This turns on `Distribution` validation by default. The motivation is to favor beginners by providing helpful error messages. Advanced users focused on speed can disable validation by calling
```py
torch.distributions.Distribution.set_default_validate_args(False)
```
or by disabling individual distribution validation via `MyDistribution(..., validate_args=False)`.
In practice I have found many beginners forget or do not know about validation. Therefore I have [enabled it by default](https://github.com/pyro-ppl/pyro/pull/2701 ) in Pyro. I believe PyTorch could also benefit from this change. Indeed validation caught a number of bugs in `.icdf()` methods, in tests, and in PPL benchmarks, all of which have been fixed in this PR.
## Release concerns
- This may slightly slow down some models. Concerned users may disable validation.
- This may cause new `ValueErrors` in models that rely on unsupported behavior, e.g. `Categorical.log_prob()` applied to continuous-valued tensors (only {0,1}-valued tensors are supported).
We should clearly note this change in release notes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48743
Reviewed By: heitorschueroff
Differential Revision: D25304247
Pulled By: neerajprad
fbshipit-source-id: 8d50f28441321ae691f848c55f71aa80cb356b41
2021-01-05 13:59:10 -08:00
Samuel Marks
e6779d4357
[*.py] Rename "Arguments:" to "Args:" ( #49736 )
...
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.
```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args: 1095
Arguments: 0336
```
It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:
- https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md )
- https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md )
- https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst )
Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.
PS: For related PRs, see tensorflow/tensorflow/pull/45420
PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch ) organisation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736
Reviewed By: albanD
Differential Revision: D25710534
Pulled By: soumith
fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
skyline75489
46b83212d1
Remove unused six code for Python 2/3 compatibility ( #48077 )
...
Summary:
This is basically a reborn version of https://github.com/pytorch/pytorch/issues/45254 .
Ref: https://github.com/pytorch/pytorch/issues/42919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48077
Reviewed By: ngimel
Differential Revision: D25687042
Pulled By: bugra
fbshipit-source-id: 05f20a6f3c5212f73d0b1505b493b720e6cf74e5
2020-12-22 18:07:08 -08:00
Alexander
44ce0b8883
Sparse-sparse matrix multiplication (CPU/CUDA) ( #39526 )
...
Summary:
This PR implements matrix multiplication support for 2-d sparse tensors using the COO sparse format.
The current implementation of `torch.sparse.mm` support this configuration,
`torch.sparse.mm(sparse_matrix1, sparse_matrix2.to_dense())`, but this could spend a lot of memory when sparse_matrix2's shape is large.
This implementation extends `torch.sparse.mm` function to support `torch.sparse.mm(sparse_matrix1, sparse_matrix2)`
Resolves #[20988](https://github.com/pytorch/pytorch/issues/20988 ) for CPU/CUDA.
- [x] sparse matmul
- [x] CPU/CUDA C++ implementation
- [x] unittests
- [x] update torch.sparse.mm documentation
- [x] autograd support
The CPU sparse-sparse matmul was implemented taking as a reference this work "Sparse Matrix Multiplication Package (SMMP)". The GPU sparse-sparse matmul is based on cuSparse, there is specific code for CUSPARSE when CUSPARSE_VERSION >= 11 and old version of CUSPARSE. Both CPU/CUDA rely on the sparse-sparse matmul algorithm using the CSR indices format as it is one of the fastest algorithm.
Here it is the latest benchmark (script is here) results for torch.sparse.mm (CUDA) and torch.sparse.mm (CPU) and scipy, values are float32 scalars:
size | density | sparse.mm(CUDA) | sparse.mm(CPU) | scipy_coo_matmul
-- | -- | -- | -- | --
(32, 10000) | 0.01 | 822.7 | 79.4 | 704.1
(32, 10000) | 0.05 | 1741.1 | 402.6 | 1155.3
(32, 10000) | 0.1 | 2956.8 | 840.8 | 1885.4
(32, 10000) | 0.25 | 6417.7 | 2832.3 | 4665.2
(512, 10000) | 0.01 | 1010.2 | 3941.3 | 26937.7
(512, 10000) | 0.05 | 2216.2 | 26903.8 | 57343.7
(512, 10000) | 0.1 | 4868.4 | 87773.7 | 117477.0
(512, 10000) | 0.25 | 16639.3 | 608105.0 | 624290.4
(1024, 10000) | 0.01 | 1224.8 | 13088.1 | 110379.2
(1024, 10000) | 0.05 | 3897.5 | 94783.9 | 236541.8
(1024, 10000) | 0.1 | 10559.1 | 405312.5 | 525483.4
(1024, 10000) | 0.25 | 57456.3 | 2424337.5 | 2729318.7
A new backward algorithm was implemented using only `sparse @ sparse` and `sparse_mask` operations. Here is some benchmarking:
```
[------------------------- sparse.mm-backward -------------------------]
| sparse.backward | dense.backward
-----------------------------------------------------------------------
(32, 10000) | 0.01 | 13.5 | 2.4
(32, 10000) | 0.05 | 52.3 | 2.4
(512, 10000) | 0.01 | 1016.8 | 491.5
(512, 10000) | 0.05 | 1604.3 | 492.3
(1024, 10000) | 0.01 | 2384.1 | 1963.7
(1024, 10000) | 0.05 | 3965.8 | 1951.9
```
I added new benchmark tests. Now I am using a real dataset used in recent studies [1, 2] with different sparsity levels.
```
[---------------------------------- matmul ---------------------------------]
| 0.5 | 0.7 | 0.8 | 0.9 | 0.95 | 0.98
1 threads: ------------------------------------------------------------------
(cpu) torch | 5.4 | 5.4 | 5.2 | 5.3 | 5.3 | 5.4
torch.sparse | 122.2 | 51.9 | 27.5 | 11.4 | 4.9 | 1.8
scipy | 150.1 | 87.4 | 69.2 | 56.8 | 38.4 | 17.1
(cuda) torch | 1.3 | 1.1 | 1.1 | 1.1 | 1.1 | 1.1
torch.sparse | 20.0 | 8.4 | 5.1 | 2.5 | 1.5 | 1.1
[----------------------------------- backward -----------------------------------]
| 0.5 | 0.7 | 0.8 | 0.9 | 0.95 | 0.98
1 threads: -----------------------------------------------------------------------
(cpu) torch | 17.7 | 17.9 | 17.7 | 17.7 | 17.6 | 17.9
torch.sparse | 672.9 | 432.6 | 327.5 | 230.8 | 176.7 | 116.7
(cuda) torch | 3.8 | 3.6 | 3.5 | 3.5 | 3.6 | 3.5
torch.sparse | 68.8 | 46.2 | 35.6 | 24.2 | 17.8 | 11.9
Times are in milliseconds (ms).
```
In summary, I can say that the new `sparse @ sparse` backward algorithm is better as it is more about saving space than performance. Moreover, it is better than other options tested before.
## **References**
1. Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen. **Sparse GPU Kernels for Deep Learning.** Proceedings of the International Conference for High Performance Computing, 2020. [https://github.com/google-research/google-research/tree/master/sgk ](https://github.com/google-research/google-research/tree/master/sgk )
2. Trevor Gale, Erich Elsen, Sara Hooker. **The State of Sparsity in Deep Neural Networks.** [https://github.com/google-research/google-research/tree/master/state_of_sparsity ](https://github.com/google-research/google-research/tree/master/state_of_sparsity )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39526
Reviewed By: mruberry
Differential Revision: D25661239
Pulled By: ngimel
fbshipit-source-id: b515ecd66d25f347d637e159d51aa45fb43b6938
2020-12-21 11:53:55 -08:00
mrshenli
e4eaa6de5f
Fix lint ( #49629 )
...
Summary:
Fix lint on master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49629
Reviewed By: rohan-varma
Differential Revision: D25654199
Pulled By: mrshenli
fbshipit-source-id: 2ab5669ad47996c0ca0f9b6611855767d5af0506
2020-12-18 19:26:06 -08:00
Pritam Damania
159de1f1d6
Add benchmark for torch.distributed.pipeline.sync.Pipe ( #49577 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49577
Repurposing the benchmarking from
https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py
and pulling in a stripped down version of the benchmark into PyTorch.
Sample output:
```
Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4)
Number of parameters for model: 292833040
| batch 1 | wps 3593.07 | loss 25.98 | ppl 192556591553.37
| batch 2 | wps 4405.16 | loss 19.36 | ppl 256201548.33
| batch 3 | wps 4404.98 | loss 23.56 | ppl 17111244076.37
| batch 4 | wps 4413.25 | loss 27.11 | ppl 594561327825.83
| batch 5 | wps 4408.53 | loss 25.92 | ppl 181277705101.33
| batch 6 | wps 4385.64 | loss 24.92 | ppl 66592883598.50
| batch 7 | wps 4434.11 | loss 24.75 | ppl 56113635884.68
| batch 8 | wps 4441.25 | loss 24.88 | ppl 63666024212.82
| batch 9 | wps 4425.49 | loss 25.35 | ppl 101959669008.98
| batch 10 | wps 4421.05 | loss 25.34 | ppl 101597621863.94
Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB,
```
ghstack-source-id: 118939686
Test Plan: sentinel
Reviewed By: rohan-varma
Differential Revision: D25628721
fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a
2020-12-18 18:33:47 -08:00
Shijun Kong
2de345d44d
Add op bench for caffe2 quantile op ( #49598 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49598
Add op bench for caffe2 quantile op
Test Plan: `buck run mode/opt caffe2/benchmarks/operator_benchmark/c2:quantile_op_test -- --wramup_iterations=10000 --iterations=10000`
Reviewed By: radkris-git
Differential Revision: D25590085
fbshipit-source-id: 0db58ac87c595b2bf2958f6299a1bf2ccea019db
2020-12-18 08:32:59 -08:00
Bram Wasti
1047957831
[te][reapply] Add fast log approximation based on sleef ( #49575 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49575
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25627157
fbshipit-source-id: a4920f4f4005ce617d372b375e790ca966275cd9
2020-12-17 17:02:00 -08:00
Edward Yang
ea4ccc730e
Revert D25445815: [te] Add fast log approximation based on sleef
...
Test Plan: revert-hammer
Differential Revision:
D25445815 (1329066b69 )
Original commit changeset: 20696eacd12a
fbshipit-source-id: 38830a6abd16260d60e5dd9a5594e65736a9c782
2020-12-17 15:03:17 -08:00
Bram Wasti
1329066b69
[te] Add fast log approximation based on sleef
...
Summary:
This is a fast log implementations
benchmark:
```
buck run mode/opt //caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench -c 'fbcode.caffe2_gpu_type=none'
```
Test Plan: buck test mode/no-gpu //caffe2/test/cpp/tensorexpr:tensorexpr -- *.fastLogFloat
Reviewed By: bertmaher
Differential Revision: D25445815
fbshipit-source-id: 20696eacd12a55e797f606f4a6dbbd94c9652888
2020-12-17 14:28:34 -08:00
Ansha Yu
cb3169d7a8
[aten] index_select dim 1 ( #47077 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47077
Add benchmarks for pt index_select, batch_index_select, and c2's BatchGather
Add batch_index_select implementation based on the C2 BatchGather implementation
This currently falls back to index_select for backwards and cuda implementations.
Alternatively, we can look into the specifics of why index_select is slower and
replace the original implementation instead.
Test Plan:
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/c2/batch_gather_test.par
./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/index_select_test.par
PT results comparing without fix, block_size 1 only, and all dim=1
```
# no optimization
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 353.450
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 862.492
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4555.344
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 11003.279
```
```
# block size 1 only
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 129.240
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 266.776
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4508.593
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 10391.655
```
```
# dim 1
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K1_dim1_cpu
# Input: M: 8, N: 8, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 3.736
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K1_dim1_cpu
# Input: M: 256, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 130.460
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K1_dim1_cpu
# Input: M: 512, N: 512, K: 1, dim: 1, device: cpu
Forward Execution Time (us) : 267.706
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M8_N8_K2_dim1_cpu
# Input: M: 8, N: 8, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 4.187
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M256_N512_K2_dim1_cpu
# Input: M: 256, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 1739.550
# Benchmarking PyTorch: index_select
# Mode: Eager
# Name: index_select_M512_N512_K2_dim1_cpu
# Input: M: 512, N: 512, K: 2, dim: 1, device: cpu
Forward Execution Time (us) : 3468.332
```
C2 results:
```# Benchmarking Caffe2: batch_gather
WARNING: Logging before InitGoogleLogging() is written to STDERR
W1203 13:19:35.310904 782584 init.h:137] Caffe2 GlobalInit should be run before any other API calls.
# Name: batch_gather_M8_N8_K1_devicecpu
# Input: M: 8, N: 8, K: 1, device: cpu
Forward Execution Time (us) : 0.308
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K1_devicecpu
# Input: M: 256, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 90.517
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K1_devicecpu
# Input: M: 512, N: 512, K: 1, device: cpu
Forward Execution Time (us) : 200.009
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M8_N8_K2_devicecpu
# Input: M: 8, N: 8, K: 2, device: cpu
Forward Execution Time (us) : 0.539
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M256_N512_K2_devicecpu
# Input: M: 256, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 1001.540
# Benchmarking Caffe2: batch_gather
# Name: batch_gather_M512_N512_K2_devicecpu
# Input: M: 512, N: 512, K: 2, device: cpu
Forward Execution Time (us) : 2005.870
```
buck test dper3/dper3/modules/low_level_modules/tests:single_operators_test -- test_batch_gather
Reviewed By: hlu1
Differential Revision: D24630227
fbshipit-source-id: cd205a30d96a33d239f3266820ada9a90093cf91
2020-12-14 15:39:33 -08:00
Bram Wasti
f4226b5c90
[static runtime] add static subgraph fusion pass ( #49185 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49185
This diff adds a fusion feature that will let us use static runtime for *parts* of the graph. This will prove useful in cases where fully eliminating control flow is hard etc.
TODO:
[x] factor out into separate fusion file
[x] add python test case
[x] add graph that isn't fully lowered test case
[x] add graph that has weird list/tuple outputs test case
the loop example looks quite good:
```
graph(%a.1 : Tensor,
%b.1 : Tensor,
%iters.1 : int):
%12 : bool = prim::Constant[value=1]() # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
%c.2 : Tensor = prim::StaticSubgraph_0(%a.1, %b.1)
%c : Tensor = prim::Loop(%iters.1, %12, %c.2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:110:4
block0(%i : int, %c.12 : Tensor):
%c.10 : Tensor = prim::StaticSubgraph_1(%a.1, %c.12, %b.1)
-> (%12, %c.10)
return (%c)
with prim::StaticSubgraph_0 = graph(%0 : Tensor,
%4 : Tensor):
%5 : int = prim::Constant[value=2]()
%6 : Tensor = aten::mul(%4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:12
%2 : int = prim::Constant[value=1]()
%c.2 : Tensor = aten::add(%0, %6, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:109:8
return (%c.2)
with prim::StaticSubgraph_1 = graph(%1 : Tensor,
%7 : Tensor,
%8 : Tensor):
%9 : int = prim::Constant[value=1]()
%c.4 : Tensor = aten::add(%7, %8, %9) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:111:12
%5 : int = prim::Constant[value=2]()
%c.7 : Tensor = aten::mul_(%c.4, %5) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:112:8
%2 : int = prim::Constant[value=1]()
%c.10 : Tensor = aten::sub_(%c.7, %1, %2) # /data/users/bwasti/fbsource/fbcode/buck-out/dev/gen/caffe2/test/static_runtime#binary,link-tree/test_static_runtime.py:113:8
return (%c.10)
```
(Note: this ignores all push blocking failures!)
Test Plan:
buck test mode/no-gpu //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/no-gpu caffe2/test:static_runtime
Reviewed By: bertmaher
Differential Revision: D25385702
fbshipit-source-id: 2f24af4f11d92a959167facd03fbd24f464a6098
2020-12-10 14:03:11 -08:00
Edward Yang
16b8e6ab01
Class-based structured kernels, with migration of add to framework ( #48718 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48718
This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check https://github.com/pytorch/rfcs/pull/9 for a mostly up-to-date high level description of what's going on here.
High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D25278031
Pulled By: ezyang
fbshipit-source-id: 57c43a6e5df21929b68964d485995fbbae4d1f7b
2020-12-09 15:39:12 -08:00
Brian Hirsh
c7cc8a48c0
migrating some straggler pytorch ops in fbcode to the new registration API ( #48954 )
...
Summary:
I already migrated the majority of fbcode ops to the new registration API, but there are a few stragglers (mostly new files that were created in the last two weeks).
The goal is mostly to stamp out as much of the legacy registration API usage as possible, so that people only see the new API when they look around the code for examples of how to register their own ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48954
ghstack-source-id: 118140663
Test Plan: Ran buck targets for each file that I migrated
Reviewed By: ezyang
Differential Revision: D25380422
fbshipit-source-id: 268139a1d7b9ef14c07befdf9e5a31f15b96a48c
2020-12-09 14:42:29 -08:00
Nikolay Korovaiko
195ab5e864
remove non-default settings in fuser.py ( #48862 )
...
Summary:
I've noticed we are setting `_jit_set_num_profiled_runs` to 2 (which isn't our default) and sometimes we don't. We are also setting `_jit_set_bailout_depth` to 20 which **is** our default. I suggest we remove this logic altogether.
I did a quick run to see if there's any impact and thankfully, the numbers seem to be consistent, but we should try avoding testing configurations that aren't default or aren't considered to become default.
numactl -C 3 python -m fastrnns.bench --fuser=te --executor=profiling
non-defaults:
```
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor='profiling', fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
cudnn 5.057 0.06287 None 7.322 0.07404 None
aten 5.602 0.06303 None 13.64 0.4078 None
jit 7.019 0.07995 None 13.77 0.554 None
jit_premul 5.324 0.06203 None 12.01 0.2996 None
jit_premul_bias 5.148 0.08061 None 11.62 0.4104 None
jit_simple 6.69 0.2317 None 13.37 0.3791 None
jit_multilayer 7.006 0.251 None 13.67 0.2239 None
py 19.05 0.1119 None 28.28 0.6346 None
Benchmarking ResNets...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
resnet18 8.712 0.01628 None 19.93 0.03512 None
resnet18_jit 8.688 0.01374 None 19.79 0.07518 None
resnet50 31.04 0.08049 None 66.44 0.08187 None
resnet50_jit 31.11 0.07171 None 66.45 0.09157 None
```
defaults:
```
Namespace(cnns=None, cuda_pointwise_block_count=None, cuda_pointwise_block_size=None, cuda_pointwise_loop_level=None, device='cuda', executor='profiling', fuser='te', group=['cnns', 'rnns'], hiddenSize=512, inputSize=512, miniBatch=64, nloops=100, numLayers=1, print_json=None, rnns=None, sep=' ', seqLength=100, variable_lstms=False, warmup=10)
Benchmarking LSTMs...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
cudnn 5.086 0.115 None 7.394 0.1743 None
aten 5.611 0.2559 None 13.54 0.387 None
jit 7.062 0.3358 None 13.24 0.3688 None
jit_premul 5.379 0.2086 None 11.57 0.3987 None
jit_premul_bias 5.202 0.2127 None 11.13 0.06748 None
jit_simple 6.648 0.05794 None 12.84 0.3047 None
jit_multilayer 6.964 0.1104 None 13.24 0.3283 None
py 19.14 0.09959 None 28.17 0.4946 None
Benchmarking ResNets...
name avg_fwd std_fwd info_fwd avg_bwd std_bwd info_bwd
resnet18 8.713 0.01563 None 19.93 0.02759 None
resnet18_jit 8.697 0.01792 None 19.78 0.06916 None
resnet50 31.14 0.07431 None 66.57 0.07418 None
resnet50_jit 31.21 0.0677 None 66.56 0.08655 None
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48862
Reviewed By: bertmaher
Differential Revision: D25342097
Pulled By: Krovatkin
fbshipit-source-id: 8d2f72c2770793ec8cecee9dfab9aaaf2e1ad2b1
2020-12-05 20:58:39 -08:00
elfringham
db1b0b06c4
Flake8 fixes ( #48453 )
...
Summary:
Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453
Reviewed By: mruberry
Differential Revision: D25181871
Pulled By: ngimel
fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641
2020-11-25 19:09:50 -08:00
Ilia Cherniavskii
f7a8bf2855
Use libkineto in profiler ( #46470 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46470
Adding ability to use Kineto (CUPTI) to profile CUDA kernels
Test Plan:
USE_KINETO=1 USE_CUDA=1 USE_MKLDNN=1 BLAS=MKL BUILD_BINARY=1 python setup.py develop install
python test/test_profiler.py
python test/test_autograd.py -k test_profile
python test/test_autograd.py -k test_record
```
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 1.000us 2
sgemm_32x32x32_NN 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 33.33% 2.000us 2.000us 1
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1
Memcpy DtoH (Device -> Pageable) 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 16.67% 1.000us 1.000us 1
aten::randn 5.17% 74.000us 6.71% 96.000us 48.000us 0.000us 0.00% 0.000us 0.000us 2
aten::empty 1.33% 19.000us 1.33% 19.000us 4.750us 0.000us 0.00% 0.000us 0.000us 4
aten::normal_ 1.05% 15.000us 1.05% 15.000us 7.500us 0.000us 0.00% 0.000us 0.000us 2
aten::to 77.90% 1.114ms 91.61% 1.310ms 436.667us 0.000us 0.00% 3.000us 1.000us 3
aten::empty_strided 2.52% 36.000us 2.52% 36.000us 12.000us 0.000us 0.00% 0.000us 0.000us 3
aten::copy_ 2.73% 39.000us 11.19% 160.000us 53.333us 0.000us 0.00% 3.000us 1.000us 3
cudaMemcpyAsync 4.34% 62.000us 4.34% 62.000us 20.667us 0.000us 0.00% 0.000us 0.000us 3
cudaStreamSynchronize 1.61% 23.000us 1.61% 23.000us 7.667us 0.000us 0.00% 0.000us 0.000us 3
aten::mm 0.21% 3.000us 7.20% 103.000us 103.000us 0.000us 0.00% 2.000us 2.000us 1
aten::stride 0.21% 3.000us 0.21% 3.000us 1.000us 0.000us 0.00% 0.000us 0.000us 3
cudaLaunchKernel 2.45% 35.000us 2.45% 35.000us 17.500us 0.000us 0.00% 0.000us 0.000us 2
aten::add 0.49% 7.000us 4.27% 61.000us 61.000us 0.000us 0.00% 1.000us 1.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
```
benchmark: https://gist.github.com/ilia-cher/a5a9eb6b68504542a3cad5150fc39b1a
Reviewed By: Chillee
Differential Revision: D25142223
Pulled By: ilia-cher
fbshipit-source-id: b0dff46c28da5fb0a8e01cf548aa4f2b723fde80
2020-11-25 04:32:16 -08:00
Hao Lu
c5dae335e4
[PT][StaticRuntime] Move prim op impl to ops.cpp ( #48210 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48210
- Move prim op implementation from `ProcessedNode::run` to `getNativeOperation`
- Add out variant for `prim::listConstruct`
Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
buck run mode/dev //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench -- \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge/container_precomputation_bs1.pt \
--iters=1 --warmup_iters=1 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true
```
Reviewed By: ajyu
Differential Revision: D24748947
fbshipit-source-id: 12caeeae87b69e60505a6cea31786bd96f5c8684
2020-11-18 23:07:39 -08:00
Bert Maher
464d23e6b4
[te][benchmark] Add more optimized versions of gemm ( #48159 )
...
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48159
Test Plan: Imported from OSS
Reviewed By: Chillee, ngimel
Differential Revision: D25059742
Pulled By: bertmaher
fbshipit-source-id: f197347f739c5bd2a4182c59ebf4642000c3dd55
2020-11-18 12:21:08 -08:00
Bram Wasti
cb046f7bd2
[static runtime] Initial memonger ( #47759 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47759
Parity reached :)
*/0 -> no memonger
*/1 -> memonger on
We can see that the impact is large when activations don't all fit in cache (6x speed up on this micro bench)
```
BM_long_static_memory_optimization/2/0 8563 ns 8559 ns 86370
BM_long_static_memory_optimization/8/0 8326 ns 8322 ns 84099
BM_long_static_memory_optimization/32/0 11446 ns 11440 ns 56107
BM_long_static_memory_optimization/512/0 6116629 ns 6113108 ns 128
BM_long_static_memory_optimization/2/1 8151 ns 8149 ns 87000
BM_long_static_memory_optimization/8/1 7905 ns 7902 ns 85124
BM_long_static_memory_optimization/32/1 10652 ns 10639 ns 66055
BM_long_static_memory_optimization/512/1 1101415 ns 1100673 ns 641
```
TODO:
[x] implementation
[x] enable/disable flag
[x] statistics about memory saved
[x] additional models
Test Plan:
```
buck test //caffe2/test:static_runtime
buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test //caffe2/caffe2/fb/predictor:pytorch_predictor_test
```
Reviewed By: yinghai
Differential Revision: D24824445
fbshipit-source-id: db1f5239f72cbd1a9444017e20d5a107c3b3f043
2020-11-17 13:55:49 -08:00
Katy Voor
fe7d1d7d0e
Add LeakyReLU operator to static runtime ( #47798 )
...
Summary:
- Add LeakyReLU operator to static runtime
- Add LeakyReLU benchmark
- Add LeakyReLU correctness test case
Static Runtime
```
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1 4092 ns 4092 ns 172331
BM_leaky_relu/8 4425 ns 4425 ns 158434
BM_leaky_relu/20 4830 ns 4830 ns 145335
BM_leaky_relu_const/1 3545 ns 3545 ns 198054
BM_leaky_relu_const/8 3825 ns 3825 ns 183074
BM_leaky_relu_const/20 4222 ns 4222 ns 165999
```
Interpreter
```
------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------
BM_leaky_relu/1 7183 ns 7182 ns 96377
BM_leaky_relu/8 7580 ns 7580 ns 91588
BM_leaky_relu/20 8066 ns 8066 ns 87183
BM_leaky_relu_const/1 6466 ns 6466 ns 107925
BM_leaky_relu_const/8 7063 ns 7063 ns 98768
BM_leaky_relu_const/20 7380 ns 7380 ns 94564
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47798
Reviewed By: ezyang
Differential Revision: D24927043
Pulled By: kavoor
fbshipit-source-id: 69b12cc57f725f1dc8d68635788813710a74dc2b
2020-11-13 22:05:52 -08:00
Yang Wang
0125e14c9a
[OpBench] change relu entry point after D24747035
...
Summary: D24747035 (1478e5ec2a ) removes the entry point of `nnq.functional.relu`. Adjust op benchmark to `torch.nn.ReLU` accordingly.
Test Plan: buck run caffe2/benchmarks/operator_benchmark/pt:qactivation_test -- --use_jit --iterations 1 --warmup_iterations 1
Reviewed By: mingzhe09088
Differential Revision: D24961625
fbshipit-source-id: 5ed0ec7fa6d8cfefc8e7fc8324cf9a2a3e59de90
2020-11-13 15:38:27 -08:00
Richard Zou
d4db4718fa
Revert D24873991: Profiler benchmark fix
...
Test Plan: revert-hammer
Differential Revision:
D24873991 (a97c7e2ef0 )
Original commit changeset: 1c3950d7d289
fbshipit-source-id: 6f3b8a49caf90aaa3e16707005b6b7cf6e61d89f
2020-11-13 08:37:14 -08:00
Yang Wang
9ee4f499f0
[OpBench] add _consume_op.list for processing input with type of List[Tensor] ( #47890 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47890
As titled. In order to fix issue when running `chunk_test`, `split_test`, `qobserver` , `sort in qunary` in jit mode, because the output of `chunk_op` is a list of tensors which can not be handled by the current `_consume_op`
Test Plan:
OSS:
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
Reviewed By: mingzhe09088
Differential Revision: D24774105
fbshipit-source-id: 210a0345b8526ebf3c24f4d0794e20b2ff6cef3d
2020-11-12 23:29:40 -08:00
Ilia Cherniavskii
a97c7e2ef0
Profiler benchmark fix ( #47713 )
...
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47713
Fix the import and also always use internal Timer
Test Plan: python benchmarks/profiler_benchmark/profiler_bench.py
Reviewed By: dzhulgakov
Differential Revision: D24873991
Pulled By: ilia-cher
fbshipit-source-id: 1c3950d7d289a4fb5bd7043ba2d842a35c263eaa
2020-11-12 21:47:30 -08:00