pytorch/caffe2/quantization/server
Haixin Liu 7f130c8494 Expose the quantized inputs and output of dynamic quantized int8 FC operator for debugging (#23566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23566

Currently if we use dynamic quantization we don't have the access to the internally quantized inputs and output for debugging.

To make the debugging easier, this diff adds a debug feature to expose the quantized X, W and Y for debugging if debug outputs are attached to the operator and caffe2_dnnlowp_force_slow_path flag is set.

The quantized inputs and output are exposed as the extra outputs.

The example Int8FC op with debug outputs appended looks like:
```
op {
  input: "X"
  input: "W"
  input: "b"
  output: "Y"
  output: "X_q"
  output: "W_q"
  output: "Y_q"
  name: ""
  type: "Int8FC"
  arg {
    name: "axis"
    i: 1
  }
  ...
}
```

Next need to expose the quantization parameters.

Reviewed By: jspark1105

Differential Revision: D16566753

fbshipit-source-id: acd855a172ee7993ddba8808f2af81b628ff9c02
2019-08-02 21:23:43 -07:00
..
__init__.py re-enable copy of python files, but be careful that the copy is only … (#14982) 2018-12-11 16:54:08 -08:00
activation_distribution_observer.cc Fix a dev mode bug in activation distribution observer (#19004) 2019-04-08 09:36:50 -07:00
activation_distribution_observer.h handle a rare case of histogram min is inf/nan (#18239) 2019-03-31 21:32:54 -07:00
batch_matmul_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
batch_matmul_dnnlowp_op.cc Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
batch_matmul_dnnlowp_op.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
batch_permutation_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
batch_permutation_dnnlowp_op.cc BaseType:: -> this-> (#13817) 2018-11-12 12:51:12 -08:00
batch_permutation_dnnlowp_op.h remove dependency to fp32 batch permutation op (#15723) 2019-01-04 07:56:05 -08:00
caffe2_dnnlowp_utils.cc Change dnnlowp log level from warning to v2 (#18576) 2019-03-29 09:29:25 -07:00
caffe2_dnnlowp_utils.h remove unused parameters from caffe2_dnnlowp_utils.cc (#14164) 2018-11-20 00:56:06 -08:00
channel_shuffle_dnnlowp_op_test.py add NCHW2NHWC and NHWC2NCHW in utils.py (#15588) 2018-12-28 17:34:50 -08:00
channel_shuffle_dnnlowp_op.cc unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
channel_shuffle_dnnlowp_op.h operators/quantized/server -> quantization/server (#13660) 2018-11-07 22:54:13 -08:00
CMakeLists.txt Improve performance of Int8SpatialBN (needed for DF4 quantization) (#19702) 2019-04-30 10:26:48 -07:00
concat_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
concat_dnnlowp_op.cc Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
concat_dnnlowp_op.h operators/quantized/server -> quantization/server (#13660) 2018-11-07 22:54:13 -08:00
conv_depthwise_dnnlowp_op_test.py Fix the depthwise 3x3x3 fast path criteria for the stride (#19692) 2019-04-24 21:35:27 -07:00
conv_dnnlowp_acc16_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
conv_dnnlowp_acc16_op.cc amend D14778810 (#18902) 2019-04-09 22:08:54 -07:00
conv_dnnlowp_acc16_op.h fold col offset into bias; optimize A symmetric quant (#17026) 2019-04-03 22:52:54 -07:00
conv_dnnlowp_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
conv_dnnlowp_op.cc include conv_op_impl.h from conv_dnnlowp_op.cc (#22458) 2019-07-02 15:09:34 -07:00
conv_dnnlowp_op.h Remove unneeded headers (#21393) 2019-06-06 14:23:54 -07:00
conv_groupwise_dnnlowp_acc16_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
conv_groupwise_dnnlowp_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
conv_pool_dnnlowp_op_base.h Replace the remaining usages of IntList in caffe2 to IntArrayRef 2019-03-21 16:34:38 -07:00
conv_relu_op.cc add ConvRelu schema (#18693) 2019-04-01 13:09:07 -07:00
conv_relu_op.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
dequantize_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
dequantize_dnnlowp_op.cc print warnings when DNNLOWP_16 or DNNLOWP_ROWWISE_16 engine is used (#17176) 2019-03-04 14:28:42 -08:00
dequantize_dnnlowp_op.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
dnnlowp_op.h Fix issue in quantization error measurement when followed by Relu (#21890) 2019-06-19 22:29:54 -07:00
dnnlowp_partition.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
dnnlowp_partition.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
dnnlowp_test_utils.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
dnnlowp.cc Minor bug fix in dnnlowp (#15841) 2019-01-09 17:18:30 -08:00
dnnlowp.h removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
dynamic_histogram_test.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
dynamic_histogram.cc handle dst_bin_width==0 case properly (#18240) 2019-03-20 17:11:25 -07:00
dynamic_histogram.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
elementwise_add_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
elementwise_add_dnnlowp_op.cc removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
elementwise_dnnlowp_op.h Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
elementwise_linear_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
elementwise_linear_dnnlowp_op.cc Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
elementwise_linear_dnnlowp_op.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
elementwise_mul_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
elementwise_mul_dnnlowp_op.cc removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
elementwise_sum_benchmark.cc optimize elementwise sum (#17456) 2019-02-27 10:12:41 -08:00
elementwise_sum_dnnlowp_op_avx2.cc optimize elementwise sum (#17456) 2019-02-27 10:12:41 -08:00
elementwise_sum_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
elementwise_sum_dnnlowp_op.cc minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
elementwise_sum_relu_op.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
fbgemm_pack_blob.h fix minor comment (#21576) 2019-06-21 22:23:53 -07:00
fbgemm_pack_matrix_cache.cc merge fully_connected_rowwise_dnnlowp_op into fully_connected_dnnlowp_op (#17105) 2019-02-15 09:50:11 -08:00
fbgemm_pack_matrix_cache.h merge fully_connected_rowwise_dnnlowp_op into fully_connected_dnnlowp_op (#17105) 2019-02-15 09:50:11 -08:00
fbgemm_pack_op.cc Fbgemm fp16 tensor support (#23101) 2019-07-19 17:08:03 -07:00
fbgemm_pack_op.h Fbgemm fp16 tensor support (#23101) 2019-07-19 17:08:03 -07:00
fc_fake_lowp_test.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
fully_connected_dnnlowp_acc16_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
fully_connected_dnnlowp_acc16_op.cc merge fully_connected_rowwise_dnnlowp_op into fully_connected_dnnlowp_op (#17105) 2019-02-15 09:50:11 -08:00
fully_connected_dnnlowp_acc16_op.h pre-pack operation of dnnlowp conv with 16-bit accumulation (#14881) 2018-12-10 01:08:21 -08:00
fully_connected_dnnlowp_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
fully_connected_dnnlowp_op.cc Expose the quantized inputs and output of dynamic quantized int8 FC operator for debugging (#23566) 2019-08-02 21:23:43 -07:00
fully_connected_dnnlowp_op.h add Int8FCRelu (#18673) 2019-04-01 23:50:30 -07:00
fully_connected_fake_lowp_op_avx2.cc eliminate FE_INVALID exceptions related to fp16 conversion (#20390) 2019-05-13 23:42:01 -07:00
fully_connected_fake_lowp_op.cc Tensor reinitialization codemod - 5/5 (#15884) 2019-01-10 16:32:26 -08:00
fully_connected_fake_lowp_op.h Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
fully_connected_fp16_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
fully_connected_rowwise_dnnlowp_op_test.py unit test with multiple op invocations (#19118) 2019-04-15 14:41:28 -07:00
gather_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
group_norm_dnnlowp_op_avx2.cc minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
group_norm_dnnlowp_op_test.py add NCHW2NHWC and NHWC2NCHW in utils.py (#15588) 2018-12-28 17:34:50 -08:00
group_norm_dnnlowp_op.cc Separate Moments from math and optimize it (#16175) 2019-01-20 08:53:25 -08:00
group_norm_dnnlowp_op.h minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
im2col_dnnlowp.h Separate elementwise level2 math functions (#16753) 2019-02-07 18:38:26 -08:00
kl_minimization_example.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
kl_minimization.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
kl_minimization.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
l1_minimization_example.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
l2_minimization_approx_example.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
l2_minimization_example.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
l2_minimization_test.cc fix overly restrictive assertion (#17939) 2019-03-12 18:18:49 -07:00
l2_minimization.h minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
lstm_unit_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
lstm_unit_dnnlowp_op.cc Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
lstm_unit_dnnlowp_op.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
mmio.h StoreMatrixInMatrixMarketFormat can store both integer and float tensors (#21606) 2019-06-11 17:28:19 -07:00
norm_minimization_avx2.cc minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
norm_minimization.cc fix overly restrictive assertion (#17939) 2019-03-12 18:18:49 -07:00
observer_test.py operators/quantized/server -> quantization/server (#13660) 2018-11-07 22:54:13 -08:00
op_wrapper.h Unify the usage of Dequantize (#15685) 2019-01-02 21:32:46 -08:00
p99_example.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
p99.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
pool_dnnlowp_op_avx2.cc SIMD version average pooling added (#22148) 2019-06-25 12:19:21 -07:00
pool_dnnlowp_op_avx2.h SIMD version average pooling added (#22148) 2019-06-25 12:19:21 -07:00
pool_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
pool_dnnlowp_op.cc SIMD version average pooling added (#22148) 2019-06-25 12:19:21 -07:00
pybind.cc Fix a dev mode bug in activation distribution observer (#19004) 2019-04-08 09:36:50 -07:00
quantization_error_minimization.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
quantize_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
quantize_dnnlowp_op.cc removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
quantize_dnnlowp_op.h clang-format (#14160) 2018-11-20 00:56:00 -08:00
README.md add pre-packing operation in README.md (#17151) 2019-02-14 22:46:47 -08:00
relu_dnnlowp_op_avx2.cc minimize code compiled with avx2 and header includes from them (#14313) 2018-11-26 11:09:21 -08:00
relu_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
relu_dnnlowp_op.cc Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
relu_dnnlowp_op.h Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
requantization_test.cc removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
resize_nearest_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
resize_nearest_dnnlowp_op.cc Add ResizeNearest DNNLOWP op (#13940) 2018-11-15 21:03:01 -08:00
resize_nearest_dnnlowp_op.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
sigmoid_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
sigmoid_dnnlowp_op.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
sigmoid_test.cc Unify the usage of Dequantize (#15685) 2019-01-02 21:32:46 -08:00
sigmoid.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
sigmoid.h use pragma once (#14163) 2018-11-20 00:56:04 -08:00
spatial_batch_norm_dnnlowp_op_avx2.cc add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_dnnlowp_op_test.py add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_dnnlowp_op.cc add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_dnnlowp_op.h add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
spatial_batch_norm_relu_op.cc add Int8SpatialBNRelu (#22014) 2019-06-20 23:23:04 -07:00
tanh_dnnlowp_op_test.py unit test with multiple omp threads (#14958) 2018-12-10 17:23:44 -08:00
tanh_dnnlowp_op.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
tanh_test.cc Unify the usage of Dequantize (#15685) 2019-01-02 21:32:46 -08:00
tanh.cc clang-format (#14160) 2018-11-20 00:56:00 -08:00
tanh.h removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
transpose.cc Change all namespace fbgemm2 in the new fbgemm2 to namespace fbgemm (#13740) 2018-11-08 19:59:12 -08:00
transpose.h Change all namespace fbgemm2 in the new fbgemm2 to namespace fbgemm (#13740) 2018-11-08 19:59:12 -08:00
utility_dnnlowp_ops.cc removing quantization utility functions moved to fbgemm (#14301) 2018-11-21 21:38:23 -08:00
utility_dnnlowp_ops.h Make it consistent for OperatorBase usage (#15908) 2019-01-11 19:32:58 -08:00
utils.py more general fusion logic (#22015) 2019-06-20 20:44:26 -07:00

This directory contains quantized Caffe2 operators optimized for Intel and AMD x86 processors, using FBGEMM backend for matrix multiplications and convolutions. Caffe2's quantized resnet50 (https://github.com/caffe2/models/tree/master/resnet50_quantized) uses the operators implemented here. We call these DNNLOWP (deep neural network low-precision, an homage to Google's gemmlowp library that our basic quantization method is based on) operators. We need to explicitly set the engine to DNNLOWP to use these operators. Otherwise, Int8* operators will use the implementation in pytorch/caffe2/operators/quantized that use QNNPACK backend whose primary optimization target is ARM mobile processors. Please use Int8* op types because op types without Int8 prefix and DNNLOWP engine is deprecated.

Quantization method

The basic quantization method is similar to that used by gemmlowp (https://github.com/google/gemmlowp/) and TensorFlow Lite. That is we use a linear quantization method where quantization and dequantization is a simple affine transformation (plus rounding and saturation in quantization). Therefore, quantization bins are uniform. Similar to gemmlowp, our quantized operators use asymmetric quantization by default but there's an option to use symmetric quantization (this can be controlled globally using caffe2_dnnlowp_preserve_activation_sparsity and caffe2_dnnlowp_preserve_weight_sparsity gflags options or per-operator basis using preserve_activation_sparsity and preserve_weight_sparsity arguments). Unsigned 8-bit integers are used for activations and signed 8-bit integers are used for weights (this design choice is mostly because int8 SIMD instructions in x86 has one input operand unsigned and the other input operand signed). We also support per-output-channel quantization similarly to gemmlowp (Int8FC with DNNLOWP_ROWWISE engine). Note that only the weights can have multiple pairs of scale and zero_offset (per output channel) and activations can still have only one pair of scale and zero_offset. This is because if an activation has per-channel quantization, inner-products in a GEMM that multiplies the activation would require summing up numbers with different scales, which is significant overhead. We also support group-wise quantization (quantize_groupwise argument of Int8Conv operator).

To compute the quantization parameters of activation tensors, we need to know their value distributions (except when we use dynamic quantization which is explained below). We have histogram_observer that can be attached to Caffe2 nets and collect the distribution. By default, the quantization parameters are selected based on the min and max, but we highly recommend to use the quantization method that minimizes the L2 norm of quantization errors or its much faster approximated version (see norm_minimization.cc). The L2 minimizing quantization can be selected globally by setting gflags caffe2_dnnlowp_activation_quantization_kind and caffe2_dnnlowp_weight_quantization_kind to L2 or L2_APPROX, or per-operator basis using activation_quantization_kind and weight_quantization_kind arguments).

Differences from gemmlowp and TensorFlow Lite

  • Floating-point requantization

Unlike gemmlowp using fixed-point operations that emulates floating point operations of requantization, fbgemm just uses single-precison floating-point operations. This is because in x86 just using single-precision floating-point operations is faster. Probably, gemmlowp used pure fixed-point operations for low-end mobile processors. QNNPACK also has similar constraints as gemmlowp and provides multiple options of requantization implementations. The users could modify the code to use a different requantization implementation to be bit-wise idential to the HW they want to emulate for example. If there're enough requests, we could consider implementing a few popular fixed-point requantization as QNNPACK did.

  • 16-bit accumulation with outlier-aware quantization

In current Intel processors, int8 multiplication with int32 accumulation doesn't provide very high speedup: 3 instructions vpmaddubsw + vpmaddwd + vpadd are needed. With 16-bit accumulation, we can use 2 instructions instead with up to 2x instruction throughput per cycle. However, 16-bit accumulation can lead to frequent saturation and hence a big accuracy drop. We minimize the saturation by splitting the weight matrix into two parts, W = W_main + W_outlier, where W_main contains values with small magnitude and W_outlier contains the residual. The matrix multiplication, X x W^T is calculated in two stages, where X x W_main^T uses 16-bit accumulation, and X x W_outlier^T uses 32-bit accumulation. W_outlier is typically sparse hence X x W_outlier^T accounts for a small fraction of the total time. This implementation can be used by setting the Caffe2 operator engine to DNNLOWP_ACC16. Conv, ConvRelu, and FC support DNNLOWP_ACC16. The threshold for outlier can be controlled by nbits_in_non_outlier argument of the operator. For example, when nbits_in_non_outlier=7, a value is outlier if it needs more than 7-bit (e.g. the value is <= -65 or >= 64).

  • Dynamic quantization

DNNLOWP operators support dynamic quantization that selects quantization parameter per mini-batch. This can be useful for example in neural machine translation models that spends most of its time in FC (while it's challenging to do end-to-end quantization due to the diverse set of operator types) and its batch size is small so dynamic quantization does not add much additional overhead. The advantage of dynamic quantization is two folds: no need for collecting the value distribution of activation tensors and possibly higher accuracy. This option can be used by setting dequantize_output operator argument.

Quantization operators

The following quantized operators are currently implemented

  • Int8BatchMatMul
  • Int8BatchPermutation
  • Int8ChannelShuffle
  • Int8Concat
  • Int8Conv and Int8ConvRelu : additionally available engine DNNLOWP_ACC16 (16-bit accumulation)
  • Int8Dequantize
  • Int8Add
  • Int8ElementwiseLinear
  • Int8Mul
  • Int8Sum and Int8SumRelu
  • Int8FC : additionally available engine DNNLOWP_ACC16 (16-bit accumulation) and DNNLOWP_ROWWISE
  • Int8GroupNorm
  • Int8LSTMUnit
  • Int8AveragePool
  • Int8MaxPool
  • Int8Relu
  • Int8ResizeNearest
  • Int8SpatialBN
  • Int8Tanh : uses lookup table
  • Int8Sigmoid : reuses the same lookup table for Tanh
  • Int8Gather

Differences from mobile quantization operators

The aim is Int8* operators in caffe2/operators/quantized (primarily optimized for mobile processors) compatible with the operators in this directory, but there're a few minor differences we will soon to fix

  • The implementation of Int8AveragePool in this directory can have different quantization parameters for its input and output.
  • Int8Sum in caffe2/operators/quantized assumes 2 input tensors while the implementation in this directory can work with arbitrary number of input tensors.

Extra functionality

  • Propagation of quantization parameters

The quantized operators in this directory can have followed_by argument like "Relu", "Sigmoid", and "Tanh". Setting this can improve the quantization accuracy for example by narrowing down the quantization parameter of an output tensor for example saturating negative values to 0 if it's followed by Relu. In fact, it's currently mandatory to set followed_by to Sigmoid and Tanh . This limitation will be fixed.

  • Measure quantization error

To facilitate numerical debugging, setting measure_quantization_error argument will run a shadow copy of single-precision floating-point operator and reports the L2 error compared to quantized outputs. This can help identifying which operator introduces the biggest error to narrow down the numerical issues.

  • Different precision

Our operators also have some functionality to emulate different fixed-point HW implementations. For example, setting activation_quantization_precision and weight_quantization_precision operator attributes to 4 can emulate 4-bit HW. We can also emulate more than 8-bit. Using that requires a different engine DNNLOWP_16 (sorry about a poor choice of name because it's confusing with DNNLOWP_ACC16).

  • Pre-packing weights

The FBGEMM needs the weight tensor pre-packed in its custom layout for high performance (note that this is not the case for activation and can stay in the standard NHWC layout because having custom activation layout will be complicated due to inter-operability with producer/consumer operators). We need other pre-processings like computing column offsets when input activations use asymmetric quantization, or separating out outliers as a sparse matrix for 16-bit accumulation. Having the results of these kind of pre-processing as a part of Conv/FC operator will duplicate the results when a weight tensor is shared among multiple operators (can happen if we run multiple copies of the same net sharing weights). fbgemm_pack_matrix_cache.h provides a quick workaround of caching the pre-processing results but we strongly recommend using pre-packing operators as shown in the following example.

init_net:

op {
  input: "Wq"
  input: "bq"
  output: "Wq_packed"
  name: "Pack_Wq"
  type: "Int8ConvPackWeight"
  engine: "DNNLOWP"
}
...

predict_net:

...
op {
  input: "Xq"
  input: "Wq_packed"
  output: "Yq"
  name: "Conv_example"
  type: "Int8Conv"
  arg {
    name: "kernel"
    i: 2
  }
  arg {
    name: "order"
    s: "NHWC"
  }
  arg {
    name: "Y_scale"
    f: 1.0
  }
  arg {
    name: "Y_zero_point"
    i: 0
  }
  engine: "DNNLOWP"
}