pytorch/torch/quantization
Supriya Rao 434af5d94a [quant] Speed up per-channel min-max observer (#34118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34118

Previously calc_per_channel_qparams was using for loops and python primitives, which called `item` many times causing slowdown during training.
    These changes uses torch primitives on the tensor to speed up the operation over 60x

    Perf results on MobileNetV2 during training using autograd profiler

    FP32 forward call -
    Self CPU time total: 47.222ms
    CUDA time total: 124.001ms

    before change
    FakeQuant Model -
    Self CPU time total: 19.107s
    CUDA time total: 27.177s

    after change
    FakeQuant Model -
    Self CPU time total: 404.667ms
    CUDA time total: 446.344ms

Test Plan:
python test/test_quantization.py

Imported from OSS

Differential Revision: D20287841

fbshipit-source-id: 6b706b8206e0d0da3c3c217b014e8da5b71b870d
2020-03-05 18:29:41 -08:00
..
__init__.py Ignore F401 in all __init__.py without putting noqa (#25823) 2019-10-23 15:28:13 -07:00
_quantize_script.py [quant][graphmode][refactor] Better API for fold_convbn (#32380) 2020-01-24 15:46:47 -08:00
default_mappings.py [quant] Add Quantized BatchNorm2d module (#33109) 2020-02-13 12:15:43 -08:00
fake_quantize.py Per channel quantization performance improvement (#33772) 2020-02-26 10:19:25 -08:00
fuse_modules.py Enable inplace relu fusion for training (#33105) 2020-02-14 12:15:58 -08:00
observer.py [quant] Speed up per-channel min-max observer (#34118) 2020-03-05 18:29:41 -08:00
qconfig.py Updates to quantization documentation (#30288) 2019-11-23 09:29:30 -08:00
quantize.py [quantization] FP16 dynamic quantized Linear 2020-01-27 15:45:32 -08:00
stubs.py Factored out the default mappings 2019-10-03 11:52:21 -07:00