Commit Graph

244 Commits

Author SHA1 Message Date
Ilia Cherniavskii
d8c384544e Destroy CUDA events after profiling (#39962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962

Adding a simple wrapper with ref count for cuda event and
destroying cuda event after the last copy is destroyed

Test Plan: CI cuda profiler tests

Differential Revision: D22027092

Pulled By: ilia-cher

fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472
2020-06-23 10:44:39 -07:00
Wojciech Baranowski
43331609a4 Port addmm, addbmm, addr to ATen (CUDA) (#38421)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24536, fixes https://github.com/pytorch/pytorch/issues/24534 and fixes https://github.com/pytorch/pytorch/issues/24533
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38421

Differential Revision: D22138333

Pulled By: VitalyFedyunin

fbshipit-source-id: f4411d0df0a001bbb95089eb55fdcac3aba86700
2020-06-22 13:02:33 -07:00
Vasiliy Kuznetsov
e35199a691 observer bench: add CUDA (#39360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360

Makes the observer microbenchmarks also run on CUDA. This is useful
now that QAT is supported in DDP and is more likely to be run
on GPUs.

Test Plan:
```
python -m pt.qobserver_test
```

Imported from OSS

Differential Revision: D21828985

fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435
2020-06-05 14:18:32 -07:00
Edward Yang
da2004e132 Upgrade lint. (#39483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483

I fixed all of the new errors that occurred because of the upgrade.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21884575

Pulled By: ezyang

fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
2020-06-04 12:56:43 -07:00
Michael Voznesensky
fce01a9bab [JIT] Make new zip serialization for torch save/load significantly (~70%) faster (#38379)
Summary:
Before:
```
2020-05-11 18:31:41 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 17.8048762,
    "median": 17.458917
  },
  "Big Tensors Load": {
    "mean": 3.2556887,
    "median": 2.9668495000000004
  },
  "Small Tensors Save": {
    "mean": 4.0381357,
    "median": 3.9440125
  },
  "Small Tensors Load": {
    "mean": 5.8792499,
    "median": 5.603067
  },
  "benchmark_run_at": "2020-05-12T01:31:41"
}
```
After
```
Use zipfile serialization: True
2020-05-12 20:15:32 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 4.7534657,
    "median": 4.646732
  },
  "Big Tensors Load": {
    "mean": 3.6001919,
    "median": 3.493285
  },
  "Small Tensors Save": {
    "mean": 4.1066924,
    "median": 4.1219255
  },
  "Small Tensors Load": {
    "mean": 6.3902358,
    "median": 6.36977
  },
  "benchmark_run_at": "2020-05-13T03:15:32"
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38379

Differential Revision: D21779494

Pulled By: voznesenskym

fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a
2020-05-29 01:56:18 -07:00
Nikita Shulga
c02e7c464a Replace import cpp_benchmark with torch.utils.cpp_benchmark (#38832)
Summary:
Otherwise, I don't understand how those could have been invoked

Also, what is the benefit of importing the same module twice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832

Differential Revision: D21675081

Pulled By: malfet

fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421
2020-05-20 18:53:09 -07:00
Ilia Cherniavskii
a94fb71b12 Memory profiling (#37775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
2020-05-19 15:48:48 -07:00
Peter Bell
0a159b0a3a Fix precision issues in CPU remainder (#38293)
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.

This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.

Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.

I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293

Differential Revision: D21539801

Pulled By: ezyang

fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
2020-05-14 08:54:32 -07:00
Supriya Rao
ae11718c45 [quant] Add quantized::conv1d op benchmarck (#38332)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38332

Test Plan:
python -m pt.qconv_test --test QConv1d_N1_IC128_OC256_L64_G1_kernel3_stride1_pad0
Forward Execution Time (us) : 147.844

python -m pt.conv_test --test Conv1d_IC128_OC256_kernel3_stride1_N1_L64_cpu
Forward Execution Time (us) : 470.750

Imported from OSS

Differential Revision: D21553662

fbshipit-source-id: 9c240a141f9cd3a82a20aa462e8e5577e002a387
2020-05-13 16:59:19 -07:00
Mikhail Zolotukhin
9a2d8dfe63 [TensorExpr] Benchmarks: set up profiling executor and fuser according to the given arguments. (#38295)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38295

Test Plan: Imported from OSS

Differential Revision: D21525741

Pulled By: ZolotukhinM

fbshipit-source-id: 8bf1d54da062c8e0653bb2cb627883ae4ed14774
2020-05-12 23:27:46 -07:00
Ilia Cherniavskii
facc5e0cc4 Make profiler thread local (#36291)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291

Move profiler state to be a thread local property,
reuse existing thread local propagation mechanism to ensure
correct profiling of async tasks. This also makes
push/pop callback thread safe and easier to use in e.g.
distributed profilier

Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit

./build/bin/test_jit
python test/test_autograd.py
python test/test_jit.py

Differential Revision: D20938501

Pulled By: ilia-cher

fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845
2020-05-07 14:52:49 -07:00
Vasiliy Kuznetsov
4fa049c525 add quantized instancenorm operator (#36847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847

Adds a quantized instancenorm operator, which can reuse most of
groupnorm's logic.

Benchmarking shows that the quantized version is about 10x faster than
floating point for equivalent input sizes
(https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm
```

Imported from OSS

Differential Revision: D21107925

fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4
2020-05-06 19:01:33 -07:00
Vasiliy Kuznetsov
b837d5d418 add quantized groupnorm operator (#36835)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835

Adds a quantized groupnorm operator.  We reuse most of the layernorm
kernel, modifying it to be able to perform channel-wise scaling.

Benchmark results: the quantized layer is between 6x to 15x faster
from fp to q, depending on input shapes
(full results:
https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9)

Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm
python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm
```

Numerics are nearly equivalent, with the only difference documented
in the test case.  The difference is the same type as with quantized
layernorm.  Making numerics equivalent is possible but will sacrifice
speed.

Imported from OSS

Differential Revision: D21107926

fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45
2020-05-06 19:01:26 -07:00
Vasiliy Kuznetsov
2773ed3082 hardswish: remove unnecessary quantize call (#36980)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980

Missed this on the original diff, fixing.  Create the output tensor directly instead of quantizing it.

Test Plan:
tests still pass
microbenchmarks show a 2x performance improvment for int8:
https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this
will depend on input size)

Imported from OSS

Differential Revision: D21185970

fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2
2020-04-22 16:15:54 -07:00
David Reiss
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
Vasiliy Kuznetsov
13391cebe2 ai-pep: match the qlinear benchmark to linear (#36674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674

Slight changes to qlinear benchmark to have it be in the same format
as linear, for fairer comparisons between FP and Q.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```

Imported from OSS

Differential Revision: D21102562

fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6
2020-04-20 09:46:32 -07:00
Vasiliy Kuznetsov
25649684ed ai-pep: align qconv benchmark to conv (#36673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673

Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.

Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```

Imported from OSS

Differential Revision: D21102563

fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
2020-04-20 09:44:09 -07:00
Vasiliy Kuznetsov
a5d0d762fa redo of add quantized layer norm implementation (#36593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36593

This is a redo of https://github.com/pytorch/pytorch/pull/35329 with a
better test.

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Differential Revision: D21030268

Pulled By: vkuzo

fbshipit-source-id: b3594c3393cfce37a881319e2e0560620d51080f
2020-04-15 19:47:18 -07:00
Supriya Rao
73f11a0b23 Update qbatch_norm2d opbenchmark test (#36630)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36630

Test Plan:
OMP_NUM_THREADS=1 python -m pt.qbatchnorm_test

Imported from OSS

Differential Revision: D21030508

fbshipit-source-id: 1ece1bd7429207732eae4dd1982ceddcdc5d3a91
2020-04-14 17:09:18 -07:00
Hameer Abbasi
7c825bad10 [RELAND] Add __torch_function__ benchmarks (#36138)
Summary:
Re-land of https://github.com/pytorch/pytorch/issues/35530 and https://github.com/pytorch/pytorch/issues/34645
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36138

Differential Revision: D20893770

Pulled By: ezyang

fbshipit-source-id: 75ab688a086f5fb87412a853df5246c0c39704ca
2020-04-10 09:14:31 -07:00
Edward Yang
88c22070fe Revert D20768930: add quantized layer norm implementation
Test Plan: revert-hammer

Differential Revision:
D20768930

Original commit changeset: ddf8727e9840

fbshipit-source-id: a190e1d1e42281eba627b0dbb6de1b3651cd5e97
2020-04-09 14:36:37 -07:00
Vasiliy Kuznetsov
f813e7184e add quantized layer norm implementation (#35329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35329

Adds a quantized implementation of LayerNorm for server.

A future PR will add the Python wrapper.

Test Plan:
numerics match the floating point implementation

benchmarks by input size:
v1 (mean+var non-vectorized): https://gist.github.com/vkuzo/f6d72c04742608112f4c2e612c74bd13
v2 (mean+var vectorized in float): https://gist.github.com/vkuzo/4dd95657c5b5f3654e0965db00eff8d2
v3 (mean+var vectorized in int, current): https://gist.github.com/vkuzo/57a75f75629da9f23b64b38ca0e3d34b

Imported from OSS

Differential Revision: D20768930

fbshipit-source-id: ddf8727e9840c65ead3b890220af0638c5637028
2020-04-09 09:11:41 -07:00
Shen Li
76c7652cc5 Add distributed data parallel benchmark tool (#35198)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198

The need for this tool was motivated by #28883. In the past, we have
done ad-hoc benchmarking, but it's time for something more structured.

It would be nice to add more model architectures so that we can get a
full picture of the performance impact of a code change simply by
running this suite a few times.

Test Plan: Imported from OSS

Differential Revision: D20591296

Pulled By: mrshenli

fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49
2020-04-08 15:07:03 -07:00
Vasiliy Kuznetsov
cc78914755 qactivation_benchmarks: small bug fix (#35731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731

Changes relu and relu6 to point to the functional implementations here.
The previous behavior tested the time to create the module, but didn't actually run the
function (I noticed this when adding the new input sizes and seeing
the measured time not change).

Test Plan:
run the benchmark, the time now changes as expected with input size for
these.

Imported from OSS

Differential Revision: D20875542

fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555
2020-04-06 15:02:33 -07:00
Vasiliy Kuznetsov
6405f26a02 add more quantized activation benchmarks and input sizes (#35729)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729

* there were a few quantized activations which had implementations but not benchmarks, adds them
* adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations

Test Plan:
```
python -m pt.qactivation_test
```

Imported from OSS

Differential Revision: D20875544

fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d
2020-04-06 15:02:29 -07:00
Vasiliy Kuznetsov
b68c3827de add benchmark for quantized batchnorm (#35389)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35389

Adds a benchmark for quantized batchnorm, with the parameters
the same compared to floating point batchnorm benchmark.

Test Plan:
run benchmarks
https://gist.github.com/vkuzo/c49be58abdf0ff64797fab3936d0cb15

Imported from OSS

Differential Revision: D20875543

fbshipit-source-id: ced89fbe2d18168e92950d0b74ca638aba54cd96
2020-04-06 15:01:05 -07:00
Mikhail Zolotukhin
9fe3b1857d [TensorExpr] Fix imports in tensorexpr benchmarks. (#35830)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35830

Test Plan: Imported from OSS

Differential Revision: D20799464

Pulled By: ZolotukhinM

fbshipit-source-id: 1b5981ad15042f601a9b6eb01a799cdf71200666
2020-04-01 14:23:33 -07:00
Michael Suo
6491bf2855 Revert D20777341: [pytorch][PR] Add __torch_function__ benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20777341

Original commit changeset: 6aaaf2a07553

fbshipit-source-id: 1c324f91f85ac624bf878297c96c682a46958954
2020-04-01 10:23:00 -07:00
Hameer Abbasi
8c534bb0bd Add __torch_function__ benchmarks. (#35530)
Summary:
Since the last one was apparently reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35530

Differential Revision: D20777341

Pulled By: ezyang

fbshipit-source-id: 6aaaf2a0755359074ae3d0efe32018d78dafe976
2020-04-01 06:30:17 -07:00
Bram Wasti
a3e10d2a17 Expose enablement of TensorExpr fuser as env variable (#35341)
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/

```
PYTORCH_TENSOREXPR=1 python benchmark.py
```

This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341

Reviewed By: ZolotukhinM

Differential Revision: D20676348

Pulled By: bwasti

fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464
2020-03-26 14:31:57 -07:00
Alban Desmaison
4d39aeec27 Revert D20653072: [pytorch][PR] Add __torch_function__ benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20653072

Original commit changeset: e7e363f8a1b8

fbshipit-source-id: e75e4979399d6fee10e00a673ea45b9bcc0fd447
2020-03-26 13:36:59 -07:00
Hameer Abbasi
bf24753570 Add __torch_function__ benchmarks. (#34645)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34645

Differential Revision: D20653072

Pulled By: ezyang

fbshipit-source-id: e7e363f8a1b84fc0c354586e266a695e4a2ea60e
2020-03-26 11:29:10 -07:00
Vasiliy Kuznetsov
f1efe51028 add quantized version of hardswish operator (#34820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34820

Adds quantized version of hardswish, for common quantized operator coverage.

Note:
* we carry over scale and zero_point from the input to the output, because the
  range of the output is unbounded if x > 0
* we also skip the .out function to not allow the user to specify a custom
  scale+zp (flexible on this).

Test Plan:
```
python test/test_quantized.py

https://gist.github.com/vkuzo/f9b579315ed7f5fdb24839e3218d8465
```

Imported from OSS

Differential Revision: D20472905

fbshipit-source-id: 0f2a83e9f5f7b43485fa46caf30e756dc5d492a9
2020-03-24 15:16:58 -07:00
Vasiliy Kuznetsov
f3e9fa6122 add hardswish FP operator (#34747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747

Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for
common operator coverage, since this is widely used.  A future PR will
add the quantized version.  CUDA is saved for a future PR as well.

Test Plan:
tests pass:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32
```

microbenchmark:
https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca
(batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s)

Imported from OSS

Differential Revision: D20451404

fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2
2020-03-24 15:15:34 -07:00
Mikhail Zolotukhin
8998a1b3d3 Add tensorexpr benchmarks. (#35064)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35064

Test Plan: Imported from OSS

Differential Revision: D20543695

Pulled By: ZolotukhinM

fbshipit-source-id: 1cf294ab19465cb93557c2b195252c739b40a0f7
2020-03-20 12:01:31 -07:00
Vasiliy Kuznetsov
bf41a7624e fix missing comma in activation benchmarks (#35104)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35104

I missed this in https://github.com/pytorch/pytorch/pull/34959
after a rebase, fixing.

Test Plan:
running benchmarks no longer crashes
CI

Imported from OSS

Differential Revision: D20560908

fbshipit-source-id: a5494e23953d3c9007e9874d673896291b5322e0
2020-03-20 11:36:05 -07:00
Vasiliy Kuznetsov
37b234a880 quantized hardsigmoid, take 2 (#34959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34959

Adds quantized implementation of hardsigmoid.

Original PR was https://github.com/pytorch/pytorch/pull/34607 and had to
be reverted for a test breakage, trying again.

Test Plan:
tests
benchmarks

Imported from OSS

Differential Revision: D20514212

fbshipit-source-id: cc7ae3b67757e2dde5c313c05ce60a0f2625d961
2020-03-19 13:27:22 -07:00
Shen Li
95f1cb34b9 Revert D20480546: adds quantized implementation of hard sigmoid
Test Plan: revert-hammer

Differential Revision:
D20480546

Original commit changeset: 9febcb44afd9

fbshipit-source-id: 4461b455e63448cf45237e23c988b492c3e0f1b0
2020-03-17 19:58:08 -07:00
Vasiliy Kuznetsov
58c5b6d306 adds quantized implementation of hard sigmoid (#34607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34607

Adds quantized version of hardsigmoid activation.

Note: not implementing the _ and .out versions is
currently intended, because the implementation changes the scale and
zp and it's nice to not allow the user to specify scale
and zp.  Lmk if we should handle this differently.

Test Plan:
tests
benchmarks

Imported from OSS

Differential Revision: D20480546

fbshipit-source-id: 9febcb44afd920125ed2ca4900492f0b712078ea
2020-03-17 16:01:39 -07:00
Rohan Varma
1e140c353c [profiler][rpc] fix a race condition in the profiler when multiple threads call (#33719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33719

We were seeing a strange error where gathering profiler events (specifically `parse_cpu_trace` in `profiler.py`) would fail with the error:
`IndexError: pop from empty list`.

It turned out that this was because for one particular `Event`, there was a pop recorded but not a push. Instead of the `push` event being completely missing, it was overwritten by a completely different event.

After a bunch of debugging, and trying several hypotheses, it turns out that this was a race condition in `RangeEventList::record`. What happened was that different threads would call into `RangeEventList::record` on the same event list instance, and one record would stomp over the data written by the other one. Somehow the data written was a valid `Event` so the error did not manifest itself until the profiler realized a `pop` was missing a matching `push` in the python code.

I fixed this by adding a lock to serialize writes to `RangeEventList::record`.

This PR also makes a small change to pass in the `RecordFunction` name into `popRange`. It makes the debugging easier when investigating the events recorded.

Differential Revision: D20071125

fbshipit-source-id: 70b51a65bcb833a7c88b7462a978fd3a39265f7e
2020-03-16 18:41:16 -07:00
Vasiliy Kuznetsov
1bac5fd0d3 add hardsigmoid FP operator to PyTorch (#34545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545

This is for common operator coverage, since this is widely used.  A future PR
will add the quantized version.

Some initial questions for reviewers, since it's my first FP operator
diff:
* do we need a backwards.out method for this?
* do we need CUDA? If yes, should it be this PR or is it ok to split

Test Plan:
```
// test
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32

// benchmark
python -m pt.hardsigmoid_test
...
Forward Execution Time (us) : 40.315

Forward Execution Time (us) : 42.603
```

Imported from OSS

Differential Revision: D20371692

fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327
2020-03-16 15:24:12 -07:00
Mikhail Zolotukhin
976d6aaa51 Revert D20251830: [TensorExpr] Add tensorexpr benchmarks.
Test Plan: revert-hammer

Differential Revision:
D20251830

Original commit changeset: bafd66ce32f6

fbshipit-source-id: d8aea4b26441d8aba90c11d7350d3424df494052
2020-03-16 13:20:16 -07:00
Mikhail Zolotukhin
e93e7b2795 [TensorExpr] Add tensorexpr benchmarks. (#34230)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34230

This PR adds some benchmarks that we used to assess tensor expressions performance.

Differential Revision: D20251830

Test Plan: Imported from OSS

Pulled By: ZolotukhinM

fbshipit-source-id: bafd66ce32f63077e3733112d854f5c750d5b1af
2020-03-16 11:49:39 -07:00
Vasiliy Kuznetsov
43c9cc7a9c add quantized ELU activation (#34267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34267

Adds quantized ELU.

Test Plan:
```
python test/test_quantized.py TestQuantizedOps.test_qelu
```

still need to benchmark, saving that for after the review comments

Imported from OSS

Differential Revision: D20370953

fbshipit-source-id: fe941bf966f72dd9eee2c4b2ef45fe7afb50c866
2020-03-12 09:31:00 -07:00
Vasiliy Kuznetsov
2e88a78d2e add quantized_hardtanh (#34097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34097

Adds quantized hardtanh.  Calls the clamp kernel behind the
scenes.

Test Plan:
```
python test/test_quantized.py
```

Imported from OSS

Differential Revision: D20208860

fbshipit-source-id: 165a6a1c22f1dcc479679e5ea0c990d0e9c3b6c5
2020-03-10 22:27:15 -07:00
Wojciech Baranowski
b10a39bb32 Migrate _cat from TH to ATen (CUDA) (#33237)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24520

Benchmarks:

Upstream:

```
$ python -m pt.cat_test --tag_filter all --device cuda  --omp_num_threads 1 --mkl_num_threads 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 17.355

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 30.718

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 17.329

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 30.176

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 74.417

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 75.728

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 190.165

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa8876fcf28>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa8876fcf28>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 57.711

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7fa886237048>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7fa886237048>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 49.903

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7fa7b57bb840>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7fa7b57bb840>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 84.181

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bba60>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bba60>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 82.339

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7fa7b57bbae8>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7fa7b57bbae8>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 82.312

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7fa7b57bbb70>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7fa7b57bbb70>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 90.715

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 129.021

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 142.966

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 387.023

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbbf8>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbbf8>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 36.647

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbc80>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbc80>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 278.890

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbd08>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbd08>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 557.752

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7fa7b57bbd90>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7fa7b57bbd90>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 842.512

```

New version:

```
$ python -m pt.cat_test --tag_filter all --device cuda  --omp_num_threads 1 --mkl_num_threads 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1,1,1)_N2_dim0_cuda
# Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 24.419

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 25.025

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(128,1024,2)_N2_dim1_cuda
# Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 24.247

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(512,512,2)_N2_dim1_cuda
# Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 25.098

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim0_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda
Forward Execution Time (us) : 74.441

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1025,1023,2)_N2_dim1_cuda
# Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda
Forward Execution Time (us) : 74.866

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(1024,1024,2)_N2_dim2_cuda
# Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 189.280

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1c9b056048>,111,65]_N5_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1c9b056048>, 111, 65], N: 5, dim: 0, device: cuda
Forward Execution Time (us) : 57.629

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[96,<function<lambda>at0x7f1c9b0560d0>,64]_N5_dim1_cuda
# Input: sizes: [96, <function <lambda> at 0x7f1c9b0560d0>, 64], N: 5, dim: 1, device: cuda
Forward Execution Time (us) : 49.975

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[128,64,<function<lambda>at0x7f1bce8f38c8>]_N5_dim2_cuda
# Input: sizes: [128, 64, <function <lambda> at 0x7f1bce8f38c8>], N: 5, dim: 2, device: cuda
Forward Execution Time (us) : 83.643

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3ae8>,32,64]_N50_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3ae8>, 32, 64], N: 50, dim: 0, device: cuda
Forward Execution Time (us) : 82.307

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[32,<function<lambda>at0x7f1bce8f3b70>,64]_N50_dim1_cuda
# Input: sizes: [32, <function <lambda> at 0x7f1bce8f3b70>, 64], N: 50, dim: 1, device: cuda
Forward Execution Time (us) : 82.323

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[33,65,<function<lambda>at0x7f1bce8f3bf8>]_N50_dim2_cuda
# Input: sizes: [33, 65, <function <lambda> at 0x7f1bce8f3bf8>], N: 50, dim: 2, device: cuda
Forward Execution Time (us) : 90.549

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda
# Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda
Forward Execution Time (us) : 129.022

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda
# Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda
Forward Execution Time (us) : 142.969

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda
# Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda
Forward Execution Time (us) : 386.973

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3c80>]_N100_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3c80>], N: 100, dim: 0, device: cuda
Forward Execution Time (us) : 43.800

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3d08>]_N1000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3d08>], N: 1000, dim: 0, device: cuda
Forward Execution Time (us) : 279.023

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3d90>]_N2000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3d90>], N: 2000, dim: 0, device: cuda
Forward Execution Time (us) : 565.790

# Benchmarking PyTorch: cat
# Mode: Eager
# Name: cat_sizes[<function<lambda>at0x7f1bce8f3e18>]_N3000_dim0_cuda
# Input: sizes: [<function <lambda> at 0x7f1bce8f3e18>], N: 3000, dim: 0, device: cuda
Forward Execution Time (us) : 845.153
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33237

Differential Revision: D20069181

Pulled By: ngimel

fbshipit-source-id: b392e1ffd72c0d8df0c5a2d3ac96f59b37c84e32
2020-02-24 17:41:16 -08:00
comet
9a2691f2fc Fix spelling errors
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32673

Differential Revision: D19597118

Pulled By: pietern

fbshipit-source-id: f88c1da7548fcee141ed248f5f49d25c1d639955
2020-01-28 04:46:15 -08:00
Huamin Li
52f8f031ac add diag into pt operator microbenchmark (#32597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32597

Currently, there is no benchmark test about diag operator. This diff will add one into the suite.

Test Plan:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim1_M64_N64_diagonal0_outTrue_cpu
# Input: dim: 1, M: 64, N: 64, diagonal: 0, out: True, device: cpu
Forward Execution Time (us) : 28.496

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim2_M128_N128_diagonal-10_outFalse_cpu
# Input: dim: 2, M: 128, N: 128, diagonal: -10, out: False, device: cpu
Forward Execution Time (us) : 45.179

# Benchmarking PyTorch: diag
# Mode: Eager
# Name: diag_dim1_M256_N256_diagonal20_outTrue_cpu
# Input: dim: 1, M: 256, N: 256, diagonal: 20, out: True, device: cpu
Forward Execution Time (us) : 49.009
```

Reviewed By: mingzhe09088

Differential Revision: D19564024

fbshipit-source-id: 828a3e0e0e06810a77eb5ddb734efd30e4a63acf
2020-01-24 15:41:04 -08:00
Brian Wignall
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
Zafar Takhirov
0ae063d5d9 Fixed concatenation benchmark + added it to the microbenchmarking runs
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31587

Test Plan: Imported from OSS

Differential Revision: D19221813

Pulled By: z-a-f

fbshipit-source-id: ee0eb60da7899b23fdc63326302d1e2fd4b540ee
2020-01-03 11:23:12 -08:00