Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39962
Adding a simple wrapper with ref count for cuda event and
destroying cuda event after the last copy is destroyed
Test Plan: CI cuda profiler tests
Differential Revision: D22027092
Pulled By: ilia-cher
fbshipit-source-id: e0810388aa60b2291eb010896e13af1fad92e472
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39360
Makes the observer microbenchmarks also run on CUDA. This is useful
now that QAT is supported in DDP and is more likely to be run
on GPUs.
Test Plan:
```
python -m pt.qobserver_test
```
Imported from OSS
Differential Revision: D21828985
fbshipit-source-id: 6da4d61f744f7a2ee5e87963b3ec84579128d435
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39483
I fixed all of the new errors that occurred because of the upgrade.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D21884575
Pulled By: ezyang
fbshipit-source-id: 45c8e1f1ecb410c8d7c46dd3922ad70e982a0685
Summary:
Otherwise, I don't understand how those could have been invoked
Also, what is the benefit of importing the same module twice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38832
Differential Revision: D21675081
Pulled By: malfet
fbshipit-source-id: fee5604c4c433161b6b1a999d505b5acbbc3b421
Summary:
Together with https://github.com/pytorch/pytorch/issues/37758, this fixes https://github.com/pytorch/pytorch/issues/37743 and fixes https://github.com/pytorch/pytorch/issues/24861.
This follows the CUDA fix in https://github.com/pytorch/pytorch/issues/37758, vectorised using a `blendv` to replace the if conditionals.
Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.
I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38293
Differential Revision: D21539801
Pulled By: ezyang
fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36291
Move profiler state to be a thread local property,
reuse existing thread local propagation mechanism to ensure
correct profiling of async tasks. This also makes
push/pop callback thread safe and easier to use in e.g.
distributed profilier
Test Plan:
USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install
./build/bin/test_jit
./build/bin/test_jit
python test/test_autograd.py
python test/test_jit.py
Differential Revision: D20938501
Pulled By: ilia-cher
fbshipit-source-id: c0c6c3eddcfea8fc7c14229534b7246a0ad25845
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36847
Adds a quantized instancenorm operator, which can reuse most of
groupnorm's logic.
Benchmarking shows that the quantized version is about 10x faster than
floating point for equivalent input sizes
(https://gist.github.com/vkuzo/2f230e84d26f26cc6030afdbfbc8e7f0)
Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_instance_norm
```
Imported from OSS
Differential Revision: D21107925
fbshipit-source-id: 6bacda402f0eb9857bc8f9a5cf8ef306150613d4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36835
Adds a quantized groupnorm operator. We reuse most of the layernorm
kernel, modifying it to be able to perform channel-wise scaling.
Benchmark results: the quantized layer is between 6x to 15x faster
from fp to q, depending on input shapes
(full results:
https://gist.github.com/vkuzo/db67623232415382dabff6c8923124e9)
Test Plan:
```
python test/quantization/test_quantized.py TestQuantizedOps.test_group_norm
python test/quantization/test_quantized.py TestQuantizedOps.test_qlayer_norm
```
Numerics are nearly equivalent, with the only difference documented
in the test case. The difference is the same type as with quantized
layernorm. Making numerics equivalent is possible but will sacrifice
speed.
Imported from OSS
Differential Revision: D21107926
fbshipit-source-id: 80e87e9e2c71310bc28c3d114c88de428819cb45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36980
Missed this on the original diff, fixing. Create the output tensor directly instead of quantizing it.
Test Plan:
tests still pass
microbenchmarks show a 2x performance improvment for int8:
https://gist.github.com/vkuzo/3b321b428e4c38e805000961c263286b (this
will depend on input size)
Imported from OSS
Differential Revision: D21185970
fbshipit-source-id: 5b9e93d9f9ac05a8120532bd03ad347541a132c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615
Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).
Test Plan: CI
Differential Revision: D20842886
Pulled By: dreiss
fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36674
Slight changes to qlinear benchmark to have it be in the same format
as linear, for fairer comparisons between FP and Q.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.linear_test
python -m pt.qlinear_test
```
Imported from OSS
Differential Revision: D21102562
fbshipit-source-id: 4f5c693b5de7e26c4326a9ec276560714290f6c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36673
Slight changes to the qconv benchmark to make it match the floating
point benchmark, so we can compare across the two better.
Test Plan:
```
cd benchmarks/operator_benchmark
python -m pt.qconv_test --tag_filter all
python -m pt.conv_test --tag_filter all
```
Imported from OSS
Differential Revision: D21102563
fbshipit-source-id: d11c1e4c13d4c5fa1f2332c687aee6889c81b659
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198
The need for this tool was motivated by #28883. In the past, we have
done ad-hoc benchmarking, but it's time for something more structured.
It would be nice to add more model architectures so that we can get a
full picture of the performance impact of a code change simply by
running this suite a few times.
Test Plan: Imported from OSS
Differential Revision: D20591296
Pulled By: mrshenli
fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35731
Changes relu and relu6 to point to the functional implementations here.
The previous behavior tested the time to create the module, but didn't actually run the
function (I noticed this when adding the new input sizes and seeing
the measured time not change).
Test Plan:
run the benchmark, the time now changes as expected with input size for
these.
Imported from OSS
Differential Revision: D20875542
fbshipit-source-id: 3a6278a7a861437d613c1e30698a58175a8e8555
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35729
* there were a few quantized activations which had implementations but not benchmarks, adds them
* adds the input sizes from `unary_tests.py` here, so we can compare fairly from fp to quantized implementations of activations
Test Plan:
```
python -m pt.qactivation_test
```
Imported from OSS
Differential Revision: D20875544
fbshipit-source-id: f55a66422233b96f0791c85b05476596d5d72b5d
Summary:
Since the last one was apparently reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35530
Differential Revision: D20777341
Pulled By: ezyang
fbshipit-source-id: 6aaaf2a0755359074ae3d0efe32018d78dafe976
Summary:
This commit allows one to use an environment variable to enable the fuser in torch/csrc/jit/tensorexpr/
```
PYTORCH_TENSOREXPR=1 python benchmark.py
```
This commit also changes the registration to happen by default, removing the requirement for the python exposed "_jit_register_tensorexpr_fuser"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35341
Reviewed By: ZolotukhinM
Differential Revision: D20676348
Pulled By: bwasti
fbshipit-source-id: 4c997cdc310e7567c03905ebff72b3e8a4c2f464
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34820
Adds quantized version of hardswish, for common quantized operator coverage.
Note:
* we carry over scale and zero_point from the input to the output, because the
range of the output is unbounded if x > 0
* we also skip the .out function to not allow the user to specify a custom
scale+zp (flexible on this).
Test Plan:
```
python test/test_quantized.py
https://gist.github.com/vkuzo/f9b579315ed7f5fdb24839e3218d8465
```
Imported from OSS
Differential Revision: D20472905
fbshipit-source-id: 0f2a83e9f5f7b43485fa46caf30e756dc5d492a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34747
Adds the hardswish FP operator from MobileNetV3 to PyTorch. This is for
common operator coverage, since this is widely used. A future PR will
add the quantized version. CUDA is saved for a future PR as well.
Test Plan:
tests pass:
```
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardswish_cpu_float32
```
microbenchmark:
https://gist.github.com/vkuzo/b10d3b238f24e58c585314e8b5385aca
(batch_size == 1: 11.5GiB/s, batch_size == 4: 11.9GiB/s)
Imported from OSS
Differential Revision: D20451404
fbshipit-source-id: c7e13c9ab1a83e27a1ba18182947c82c896efae2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34959
Adds quantized implementation of hardsigmoid.
Original PR was https://github.com/pytorch/pytorch/pull/34607 and had to
be reverted for a test breakage, trying again.
Test Plan:
tests
benchmarks
Imported from OSS
Differential Revision: D20514212
fbshipit-source-id: cc7ae3b67757e2dde5c313c05ce60a0f2625d961
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34607
Adds quantized version of hardsigmoid activation.
Note: not implementing the _ and .out versions is
currently intended, because the implementation changes the scale and
zp and it's nice to not allow the user to specify scale
and zp. Lmk if we should handle this differently.
Test Plan:
tests
benchmarks
Imported from OSS
Differential Revision: D20480546
fbshipit-source-id: 9febcb44afd920125ed2ca4900492f0b712078ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33719
We were seeing a strange error where gathering profiler events (specifically `parse_cpu_trace` in `profiler.py`) would fail with the error:
`IndexError: pop from empty list`.
It turned out that this was because for one particular `Event`, there was a pop recorded but not a push. Instead of the `push` event being completely missing, it was overwritten by a completely different event.
After a bunch of debugging, and trying several hypotheses, it turns out that this was a race condition in `RangeEventList::record`. What happened was that different threads would call into `RangeEventList::record` on the same event list instance, and one record would stomp over the data written by the other one. Somehow the data written was a valid `Event` so the error did not manifest itself until the profiler realized a `pop` was missing a matching `push` in the python code.
I fixed this by adding a lock to serialize writes to `RangeEventList::record`.
This PR also makes a small change to pass in the `RecordFunction` name into `popRange`. It makes the debugging easier when investigating the events recorded.
Differential Revision: D20071125
fbshipit-source-id: 70b51a65bcb833a7c88b7462a978fd3a39265f7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34545
This is for common operator coverage, since this is widely used. A future PR
will add the quantized version.
Some initial questions for reviewers, since it's my first FP operator
diff:
* do we need a backwards.out method for this?
* do we need CUDA? If yes, should it be this PR or is it ok to split
Test Plan:
```
// test
python test/test_torch.py TestTorchDeviceTypeCPU.test_hardsigmoid_cpu_float32
// benchmark
python -m pt.hardsigmoid_test
...
Forward Execution Time (us) : 40.315
Forward Execution Time (us) : 42.603
```
Imported from OSS
Differential Revision: D20371692
fbshipit-source-id: 95668400da9577fd1002ce3f76b9777c6f96c327
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34230
This PR adds some benchmarks that we used to assess tensor expressions performance.
Differential Revision: D20251830
Test Plan: Imported from OSS
Pulled By: ZolotukhinM
fbshipit-source-id: bafd66ce32f63077e3733112d854f5c750d5b1af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34267
Adds quantized ELU.
Test Plan:
```
python test/test_quantized.py TestQuantizedOps.test_qelu
```
still need to benchmark, saving that for after the review comments
Imported from OSS
Differential Revision: D20370953
fbshipit-source-id: fe941bf966f72dd9eee2c4b2ef45fe7afb50c866