pytorch/caffe2/quantization/server
Blaise Sanouillet dd95bf65b6 [caffe2/FC DNNLOWP] Shrink Y_int32_ vector capacity when appropriate
Summary:
The FullyConnectedDNNLowPOp::Y_int32_ vectors consume between 1GB and 2GB on one of FB's larger applications. By adding tracing I noticed that the number of elements in each instance oscillates wildy over time. As the buffer backing a vector can only be extended in a resize operation, this means there is wasted memory space. So as a simple optimization, I added code to right-size the buffer backing the vector when the number of elements is less than half the vector capacity at that point; this doesn't affect the existing elements.

There is of course a memory/cpu tradeoff here - with the change we are doing more mallocs and frees. I added tracing to measure how many times we grow or shrink per second: it's about 100 per second on average, which is not a great deal.

Test Plan:
Memory growth impact: over 24 hours and after the startup period, the memory consumed by this code grows from 0.85GB to 1.20GB vs 0.95GB to 1.75GB in the baseline. [ source: https://fburl.com/scuba/heap_profiles/wm47kpfe ]
https://pxl.cl/1pHlJ

Reviewed By: jspark1105

Differential Revision: D24592098

fbshipit-source-id: 7892b35f24e42403653a74a1a9d06cbc7ee866b9
2020-10-29 11:19:45 -07:00
..
__init__.py remediation of S205607 2020-07-17 17:19:47 -07:00
activation_distribution_observer.cc Add histogram collection and weight prepacking utils (#33125) 2020-02-13 01:40:20 -08:00
activation_distribution_observer.h Add API to collect output_col_minmax_histogram 2020-08-05 12:33:10 -07:00
batch_matmul_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
batch_matmul_dnnlowp_op.cc [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
batch_matmul_dnnlowp_op.h
batch_permutation_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
batch_permutation_dnnlowp_op.cc
batch_permutation_dnnlowp_op.h
caffe2_dnnlowp_utils.cc add warning to dnnlowp fc if quantization kind is not min_max 2019-10-04 17:03:19 -07:00
caffe2_dnnlowp_utils.h
channel_shuffle_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
channel_shuffle_dnnlowp_op.cc batch size 0 support in ChannelShuffle DNNLOWP op (#26858) 2019-09-26 00:40:07 -07:00
channel_shuffle_dnnlowp_op.h
CMakeLists.txt Use Int8QuantParamsBlob to pass the scale and zeropoint params (#40494) 2020-06-24 10:20:16 -07:00
compute_equalization_scale_test.py [mlf][efficiency] modify equalization scale operator to return single output (#46449) 2020-10-16 01:22:37 -07:00
compute_equalization_scale.cc [mlf][efficiency] modify equalization scale operator to return single output (#46449) 2020-10-16 01:22:37 -07:00
compute_equalization_scale.h Add operator to compute the equalization scale (#45096) 2020-09-24 15:19:49 -07:00
concat_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
concat_dnnlowp_op.cc
concat_dnnlowp_op.h
conv_depthwise_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
conv_dnnlowp_acc16_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
conv_dnnlowp_acc16_op.cc amend D14778810 (#18902) 2019-04-09 22:08:54 -07:00
conv_dnnlowp_acc16_op.h fold col offset into bias; optimize A symmetric quant (#17026) 2019-04-03 22:52:54 -07:00
conv_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
conv_dnnlowp_op.cc [fbgemm] use new more general depthwise 3d conv interface (#42697) 2020-08-07 18:30:56 -07:00
conv_dnnlowp_op.h use fbgemm's 3d group conv fast path (#29085) 2019-11-05 00:58:49 -08:00
conv_groupwise_dnnlowp_acc16_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
conv_groupwise_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
conv_pool_dnnlowp_op_base.h
conv_relu_op.cc add ConvRelu schema (#18693) 2019-04-01 13:09:07 -07:00
conv_relu_op.h
dequantize_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
dequantize_dnnlowp_op.cc
dequantize_dnnlowp_op.h
dnnlowp_op.h Extend int8 FC op to take scale and zero point from input 2020-06-13 02:34:45 -07:00
dnnlowp_partition.cc Enable batch_size = 0 support in DNNLOWP Concat operator (#26849) 2019-09-26 00:03:40 -07:00
dnnlowp_partition.h
dnnlowp_test_utils.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
dnnlowp.cc Int8 PTQ ops for online training (#39818) 2020-06-12 11:41:30 -07:00
dnnlowp.h [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
dynamic_histogram_test.cc Disable openmp in static and dynamic histograms (#30072) 2019-11-19 00:32:46 -08:00
dynamic_histogram.cc Minor fix for quantizing the Ads complex model 2020-03-04 08:34:59 -08:00
dynamic_histogram.h Disable openmp in static and dynamic histograms (#30072) 2019-11-19 00:32:46 -08:00
elementwise_add_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
elementwise_add_dnnlowp_op.cc use avx2 for Add without broadcast and when inputs are uint8_t (#25098) 2019-08-23 18:20:22 -07:00
elementwise_dnnlowp_op.h
elementwise_linear_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
elementwise_linear_dnnlowp_op.cc
elementwise_linear_dnnlowp_op.h
elementwise_mul_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
elementwise_mul_dnnlowp_op.cc
elementwise_sum_benchmark.cc
elementwise_sum_dnnlowp_op_avx2.cc
elementwise_sum_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
elementwise_sum_dnnlowp_op.cc [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
elementwise_sum_relu_op.cc Adding Histogram Binning Calibration to DSNN and Adding Type Double to Caffe2 ParallelSumOp/SumReluOp 2020-09-28 15:21:31 -07:00
fb_fc_packed_op.cc [FakeLowp] Open source more c2 ops (#38878) 2020-05-21 19:10:04 -07:00
fb_fc_packed_op.h Fix PackedGemmMatrixFP16 repacking (#43320) 2020-08-21 10:58:18 -07:00
fbgemm_fp16_pack_op.cc Open source fbgemm fp16 pack op (#36791) 2020-04-17 21:00:52 -07:00
fbgemm_fp16_pack_op.h [Fakelowp] Open source fake fp16 FC ops (#37923) 2020-05-06 23:53:27 -07:00
fbgemm_pack_blob.h Fix missing header (#34762) 2020-03-15 00:19:42 -07:00
fbgemm_pack_matrix_cache.cc
fbgemm_pack_matrix_cache.h
fbgemm_pack_op.cc fix int8 FC (#42691) 2020-08-12 09:30:34 -07:00
fbgemm_pack_op.h fix int8 FC (#42691) 2020-08-12 09:30:34 -07:00
fc_fake_lowp_test.cc
fully_connected_dnnlowp_acc16_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
fully_connected_dnnlowp_acc16_op.cc
fully_connected_dnnlowp_acc16_op.h
fully_connected_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
fully_connected_dnnlowp_op.cc [caffe2/FC DNNLOWP] Shrink Y_int32_ vector capacity when appropriate 2020-10-29 11:19:45 -07:00
fully_connected_dnnlowp_op.h add Int8FCRelu (#18673) 2019-04-01 23:50:30 -07:00
fully_connected_fake_lowp_op_avx2.cc [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
fully_connected_fake_lowp_op.cc
fully_connected_fake_lowp_op.h
fully_connected_fp16_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
fully_connected_rowwise_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
gather_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
group_norm_dnnlowp_op_avx2.cc
group_norm_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
group_norm_dnnlowp_op.cc batch size 0 support in norm operators (#26894) 2019-09-26 16:08:35 -07:00
group_norm_dnnlowp_op.h
im2col_dnnlowp.h
int8_gen_quant_params_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
int8_gen_quant_params.cc Adjust bound_shape_inferencer to take 4 inputs for FCs (#41934) 2020-08-05 18:44:48 -07:00
int8_gen_quant_params.h Add serializer and deserializer for Int8QuantSchemeBlob and Int8QuantParamsBlob (#40661) 2020-07-02 17:17:05 -07:00
int8_quant_scheme_blob_fill_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
int8_quant_scheme_blob_fill.cc Adjust bound_shape_inferencer to take 4 inputs for FCs (#41934) 2020-08-05 18:44:48 -07:00
int8_quant_scheme_blob_fill.h Op to create quant scheme blob (#40760) 2020-07-11 10:53:10 -07:00
kl_minimization_example.cc
kl_minimization.cc
kl_minimization.h
l1_minimization_example.cc
l2_minimization_approx_example.cc
l2_minimization_example.cc
l2_minimization_test.cc
l2_minimization.h
lstm_unit_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
lstm_unit_dnnlowp_op.cc
lstm_unit_dnnlowp_op.h
mmio.h StoreMatrixInMatrixMarketFormat can store both integer and float tensors (#21606) 2019-06-11 17:28:19 -07:00
norm_minimization_avx2.cc
norm_minimization.cc [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
observer_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
op_wrapper.h
p99_example.cc
p99.cc add DNNLOWP static qparam choosing to pybind 2019-10-29 12:05:33 -07:00
pool_dnnlowp_op_avx2.cc Average Pooling 3D AVX2 Implementation (#26111) 2019-09-17 03:41:34 -07:00
pool_dnnlowp_op_avx2.h Average Pooling 3D AVX2 Implementation (#26111) 2019-09-17 03:41:34 -07:00
pool_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
pool_dnnlowp_op.cc Average Pooling 3D AVX2 Implementation (#26111) 2019-09-17 03:41:34 -07:00
pybind.cc add fake fp16 fusions to net transforms (#42927) 2020-08-14 13:30:27 -07:00
quantization_error_minimization.h Add P99 method with configurable thresholds 2019-09-27 15:53:20 -07:00
quantize_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
quantize_dnnlowp_op.cc [caffe2][dnnlowp] Remove openmp usage in quantize dnnlowp op 2020-10-20 19:33:56 -07:00
quantize_dnnlowp_op.h
README.md
relu_dnnlowp_op_avx2.cc
relu_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
relu_dnnlowp_op.cc
relu_dnnlowp_op.h
requantization_test.cc
resize_nearest_3d_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
resize_nearest_3d_dnnlowp_op.cc Migrate the cpu and gpu implementations of resize nearest 3D from vision to caffe2 2019-10-03 16:14:00 -07:00
resize_nearest_3d_dnnlowp_op.h Migrate the cpu and gpu implementations of resize nearest 3D from vision to caffe2 2019-10-03 16:14:00 -07:00
resize_nearest_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
resize_nearest_dnnlowp_op.cc
resize_nearest_dnnlowp_op.h
sigmoid_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
sigmoid_dnnlowp_op.cc
sigmoid_test.cc
sigmoid.cc
sigmoid.h
spatial_batch_norm_dnnlowp_op_avx2.cc add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
spatial_batch_norm_dnnlowp_op.cc batch size 0 support in norm operators (#26894) 2019-09-26 16:08:35 -07:00
spatial_batch_norm_dnnlowp_op.h add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_relu_op.cc add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
tanh_dnnlowp_op_test.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00
tanh_dnnlowp_op.cc
tanh_test.cc
tanh.cc
tanh.h
transpose.cc [pt][fbgemm] Turn on USE_FBGEMM on Windows env (#297) 2020-02-19 15:09:21 -08:00
transpose.h
utility_dnnlowp_ops.cc
utility_dnnlowp_ops.h
utils.py Remove __future__ imports for legacy Python2 supports (#45033) 2020-09-23 17:57:02 -07:00

This directory contains quantized Caffe2 operators optimized for Intel and AMD x86 processors, using FBGEMM backend for matrix multiplications and convolutions. Caffe2's quantized resnet50 (https://github.com/caffe2/models/tree/master/resnet50_quantized) uses the operators implemented here. We call these DNNLOWP (deep neural network low-precision, an homage to Google's gemmlowp library that our basic quantization method is based on) operators. We need to explicitly set the engine to DNNLOWP to use these operators. Otherwise, Int8* operators will use the implementation in pytorch/caffe2/operators/quantized that use QNNPACK backend whose primary optimization target is ARM mobile processors. Please use Int8* op types because op types without Int8 prefix and DNNLOWP engine is deprecated.

Quantization method

The basic quantization method is similar to that used by gemmlowp (https://github.com/google/gemmlowp/) and TensorFlow Lite. That is we use a linear quantization method where quantization and dequantization is a simple affine transformation (plus rounding and saturation in quantization). Therefore, quantization bins are uniform. Similar to gemmlowp, our quantized operators use asymmetric quantization by default but there's an option to use symmetric quantization (this can be controlled globally using caffe2_dnnlowp_preserve_activation_sparsity and caffe2_dnnlowp_preserve_weight_sparsity gflags options or per-operator basis using preserve_activation_sparsity and preserve_weight_sparsity arguments). Unsigned 8-bit integers are used for activations and signed 8-bit integers are used for weights (this design choice is mostly because int8 SIMD instructions in x86 has one input operand unsigned and the other input operand signed). We also support per-output-channel quantization similarly to gemmlowp (Int8FC with DNNLOWP_ROWWISE engine). Note that only the weights can have multiple pairs of scale and zero_offset (per output channel) and activations can still have only one pair of scale and zero_offset. This is because if an activation has per-channel quantization, inner-products in a GEMM that multiplies the activation would require summing up numbers with different scales, which is significant overhead. We also support group-wise quantization (quantize_groupwise argument of Int8Conv operator).

To compute the quantization parameters of activation tensors, we need to know their value distributions (except when we use dynamic quantization which is explained below). We have histogram_observer that can be attached to Caffe2 nets and collect the distribution. By default, the quantization parameters are selected based on the min and max, but we highly recommend to use the quantization method that minimizes the L2 norm of quantization errors or its much faster approximated version (see norm_minimization.cc). The L2 minimizing quantization can be selected globally by setting gflags caffe2_dnnlowp_activation_quantization_kind and caffe2_dnnlowp_weight_quantization_kind to L2 or L2_APPROX, or per-operator basis using activation_quantization_kind and weight_quantization_kind arguments).

Differences from gemmlowp and TensorFlow Lite

  • Floating-point requantization

Unlike gemmlowp using fixed-point operations that emulates floating point operations of requantization, fbgemm just uses single-precison floating-point operations. This is because in x86 just using single-precision floating-point operations is faster. Probably, gemmlowp used pure fixed-point operations for low-end mobile processors. QNNPACK also has similar constraints as gemmlowp and provides multiple options of requantization implementations. The users could modify the code to use a different requantization implementation to be bit-wise idential to the HW they want to emulate for example. If there're enough requests, we could consider implementing a few popular fixed-point requantization as QNNPACK did.

  • 16-bit accumulation with outlier-aware quantization

In current Intel processors, int8 multiplication with int32 accumulation doesn't provide very high speedup: 3 instructions vpmaddubsw + vpmaddwd + vpadd are needed. With 16-bit accumulation, we can use 2 instructions instead with up to 2x instruction throughput per cycle. However, 16-bit accumulation can lead to frequent saturation and hence a big accuracy drop. We minimize the saturation by splitting the weight matrix into two parts, W = W_main + W_outlier, where W_main contains values with small magnitude and W_outlier contains the residual. The matrix multiplication, X x W^T is calculated in two stages, where X x W_main^T uses 16-bit accumulation, and X x W_outlier^T uses 32-bit accumulation. W_outlier is typically sparse hence X x W_outlier^T accounts for a small fraction of the total time. This implementation can be used by setting the Caffe2 operator engine to DNNLOWP_ACC16. Conv, ConvRelu, and FC support DNNLOWP_ACC16. The threshold for outlier can be controlled by nbits_in_non_outlier argument of the operator. For example, when nbits_in_non_outlier=7, a value is outlier if it needs more than 7-bit (e.g. the value is <= -65 or >= 64).

  • Dynamic quantization

DNNLOWP operators support dynamic quantization that selects quantization parameter per mini-batch. This can be useful for example in neural machine translation models that spends most of its time in FC (while it's challenging to do end-to-end quantization due to the diverse set of operator types) and its batch size is small so dynamic quantization does not add much additional overhead. The advantage of dynamic quantization is two folds: no need for collecting the value distribution of activation tensors and possibly higher accuracy. This option can be used by setting dequantize_output operator argument.

Quantization operators

The following quantized operators are currently implemented

  • Int8BatchMatMul
  • Int8BatchPermutation
  • Int8ChannelShuffle
  • Int8Concat
  • Int8Conv and Int8ConvRelu : additionally available engine DNNLOWP_ACC16 (16-bit accumulation)
  • Int8Dequantize
  • Int8Add
  • Int8ElementwiseLinear
  • Int8Mul
  • Int8Sum and Int8SumRelu
  • Int8FC : additionally available engine DNNLOWP_ACC16 (16-bit accumulation) and DNNLOWP_ROWWISE
  • Int8GroupNorm
  • Int8LSTMUnit
  • Int8AveragePool
  • Int8MaxPool
  • Int8Relu
  • Int8ResizeNearest
  • Int8SpatialBN
  • Int8Tanh : uses lookup table
  • Int8Sigmoid : reuses the same lookup table for Tanh
  • Int8Gather

Differences from mobile quantization operators

The aim is Int8* operators in caffe2/operators/quantized (primarily optimized for mobile processors) compatible with the operators in this directory, but there're a few minor differences we will soon to fix

  • The implementation of Int8AveragePool in this directory can have different quantization parameters for its input and output.
  • Int8Sum in caffe2/operators/quantized assumes 2 input tensors while the implementation in this directory can work with arbitrary number of input tensors.

Extra functionality

  • Propagation of quantization parameters

The quantized operators in this directory can have followed_by argument like "Relu", "Sigmoid", and "Tanh". Setting this can improve the quantization accuracy for example by narrowing down the quantization parameter of an output tensor for example saturating negative values to 0 if it's followed by Relu. In fact, it's currently mandatory to set followed_by to Sigmoid and Tanh . This limitation will be fixed.

  • Measure quantization error

To facilitate numerical debugging, setting measure_quantization_error argument will run a shadow copy of single-precision floating-point operator and reports the L2 error compared to quantized outputs. This can help identifying which operator introduces the biggest error to narrow down the numerical issues.

  • Different precision

Our operators also have some functionality to emulate different fixed-point HW implementations. For example, setting activation_quantization_precision and weight_quantization_precision operator attributes to 4 can emulate 4-bit HW. We can also emulate more than 8-bit. Using that requires a different engine DNNLOWP_16 (sorry about a poor choice of name because it's confusing with DNNLOWP_ACC16).

  • Pre-packing weights

The FBGEMM needs the weight tensor pre-packed in its custom layout for high performance (note that this is not the case for activation and can stay in the standard NHWC layout because having custom activation layout will be complicated due to inter-operability with producer/consumer operators). We need other pre-processings like computing column offsets when input activations use asymmetric quantization, or separating out outliers as a sparse matrix for 16-bit accumulation. Having the results of these kind of pre-processing as a part of Conv/FC operator will duplicate the results when a weight tensor is shared among multiple operators (can happen if we run multiple copies of the same net sharing weights). fbgemm_pack_matrix_cache.h provides a quick workaround of caching the pre-processing results but we strongly recommend using pre-packing operators as shown in the following example.

init_net:

op {
  input: "Wq"
  input: "bq"
  output: "Wq_packed"
  name: "Pack_Wq"
  type: "Int8ConvPackWeight"
  engine: "DNNLOWP"
}
...

predict_net:

...
op {
  input: "Xq"
  input: "Wq_packed"
  output: "Yq"
  name: "Conv_example"
  type: "Int8Conv"
  arg {
    name: "kernel"
    i: 2
  }
  arg {
    name: "order"
    s: "NHWC"
  }
  arg {
    name: "Y_scale"
    f: 1.0
  }
  arg {
    name: "Y_zero_point"
    i: 0
  }
  engine: "DNNLOWP"
}