Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23566
Currently if we use dynamic quantization we don't have the access to the internally quantized inputs and output for debugging.
To make the debugging easier, this diff adds a debug feature to expose the quantized X, W and Y for debugging if debug outputs are attached to the operator and caffe2_dnnlowp_force_slow_path flag is set.
The quantized inputs and output are exposed as the extra outputs.
The example Int8FC op with debug outputs appended looks like:
```
op {
input: "X"
input: "W"
input: "b"
output: "Y"
output: "X_q"
output: "W_q"
output: "Y_q"
name: ""
type: "Int8FC"
arg {
name: "axis"
i: 1
}
...
}
```
Next need to expose the quantization parameters.
Reviewed By: jspark1105
Differential Revision: D16566753
fbshipit-source-id: acd855a172ee7993ddba8808f2af81b628ff9c02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22143
Like Conv DNNLOWP operator, allow FC to run the slow path to debug numerical issues caused by Intel's int8 instruction that does horizontal addition of 2 int8 multiplication results in 16 bit
Reviewed By: hx89
Differential Revision: D15966885
fbshipit-source-id: c6726376a3e39d341fd8aeb0e54e0450d2af8920
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22015
Previous fusion logic only works for operators back-to-back in the linear order of protobuf file.
This diff generalizes to work for any predecessor-successor operators in the graph without any "interfering" use/def of the related blobs.
Reviewed By: csummersea
Differential Revision: D15916709
fbshipit-source-id: 82fe4911a8250845a8bea3427d1b77ce2442c495
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21606
StoreMatrixInMatrixMarketFormat was able to dump quantized tensors only but sometimes we want to dump float tensors.
Reviewed By: csummersea
Differential Revision: D15741611
fbshipit-source-id: 95b03c2fdf1bd8407f7d925171d9dc9f25677464
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21393
Result of splitting the base diff. We moved a header from src/* to include/fbgemm/*
Reviewed By: jianyuh
Differential Revision: D15635188
fbshipit-source-id: ad7d0ddba964ff1cb8b2e33f5f98e457a4d2eac9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390
duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop.
Reviewed By: hx89
Differential Revision: D15307358
fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19681
For accelerator, we need to lower just the quantized weights data without layout transformation. This diff attempts to provide this option.
Reviewed By: jerryzh168, zrphercule
Differential Revision: D15066568
fbshipit-source-id: 133d749e087c2ad4a899bee5e96f597f70b2443c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19118
A bug introduced by D14700576 reported by Yufei (fixed by D14778810 and D14785256) was not detected by our units tests.
This diff improves unit tests to catch such errors (with this diff and without D14778810, we can reproduce the bug Yufei reported).
This improvement also revealed a bug that affects the accuracy when we pre-pack weight and bias together and the pre-packed weight/bias are used by multiple nets. We were modifying the pre-packed bias in-place which was supposed to be constants.
Reviewed By: csummersea
Differential Revision: D14806077
fbshipit-source-id: aa9049c74b6ea98d21fbd097de306447a662a46d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18902
Fix in D14778810 had an issue that when we fallback to acc32 because the density of outlier is too high W_quantized_ is already modified. In this diff we first just count the number of outliers (without modifying W_quantized_) and only when density is low enough and no need for fallback we modify W_quantized_ and construct an outlier matrix.
Reviewed By: jspark1105
Differential Revision: D14785256
fbshipit-source-id: 03933110a4ca7409686a06b18a9bb921f8657950
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19004
Handling the exception case when the data has min 3.40282e+38 max -3.40282e+38
Reviewed By: jspark1105
Differential Revision: D14822193
fbshipit-source-id: b9771d1584fdf8317f5b8c7f5806be5d27314386
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18974
When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.
Reviewed By: bddppq
Differential Revision: D14814067
fbshipit-source-id: aec917322de695e283f0aca1e930c5603d196404
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18878
When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.
TODO: add unit tests with better coverage
Reviewed By: feiyu1990
Differential Revision: D14778810
fbshipit-source-id: d49a8c4b7c815ab29b77feb53ee730ad63780488
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17026
D14013931 was for FC. This diff is similar optimizations for Conv.
A subtle difference is that in FC, once we fold col_offset into bias during pre-processing step, we can treat everything as if A_zero_offset == 0 (symmetric quantization of A).
In Conv, we can't do this because padding still needs to use the original A_zero_offset.
From requantization point of view, once col_offset folded into bias, we can treat as if we're doing symmetric A quantization.
But, for steps involving padding like im2col, im2col fused with packing, and direct conv for depth-wise/group convolution we still need to pass the original A_zero_offset.
Reviewed By: jianyuh
Differential Revision: D14020276
fbshipit-source-id: c29caefd1127bbc6aff0e9d535939bb0c1ecb66c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18672
In Skylake, when n < 128 or k < 128, acc16 is slower.
Reviewed By: jianyuh
Differential Revision: D14700576
fbshipit-source-id: 80ca9f1af4626637eed9c5ca49f95ae744811189
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18239
When min is inf or nan, we get UBSAN errors
Reviewed By: csummersea
Differential Revision: D14537668
fbshipit-source-id: e70ffb5ecd2b10793356070c69fdabf8f25b203e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16373
motivation: https://github.com/pytorch/pytorch/pull/12407
This is a manual diff.
most of the fixes should be:
```
auto* Y = Output(0);
Y->Resize(dims);
Y->raw_mutable_data(dtype);
```
-->
```
auto* Y = Output(0, dims, at::dtype(dtype));
```
But there might be other cases.
Reviewed By: dzhulgakov
Differential Revision: D13725460
fbshipit-source-id: 649a4b0e42f62cda1a60171dd9fa3e440dc9dca1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18246
Simplifies histogram collection and quantization process.
Histogram collection before this diff was something like this
```
from caffe2.quantization.server import dnnlowp_pybind11
...
dnnlowp_pybind11.ObserveHistogramOfOutput(hist_file)
for ...
workspace.RunNet(predict_net)
dnnlowp_pybind11.ClearNetObservers() # This is to trigger Stop function in the observer to dump out histogram file but this can have unintended consequence of also clearing all the other useful observers we attached
```
After this diff we can
```
workspace.CreateNet(predict_net) # Note we need to create net to have a net to attach observer
histogram_observer = dnnlowp_pybind11.AddHistogramObserver(predic_net, hist_file)
for ...
workspace.RunNet(predict_net)
predict_net.RemoveObserver(histogram_observer)
```
Choosing quantization parameters of weights before this diff was something like this
```
dnnlowp_pybind11.ObserveHistogramOfOutput(weight_hist_file)
workspace.RunNetOnce(init_net)
dnnlowp_pybind11.ClearNetObservers() # Has same issue as the histogram collection example above
dnnlowp_pybind11.RegisterQuantizationParamsWithHistogram(
weight_hist_file, is_weight=True, qparams_output_file_name=qparams_file
)
workspace.CreateNet(init_net, overwrite=True)
dnnlowp_pybind11.ClearNetObservers()
logger.info("Loading quantization params from {}".format(qparams_file))
blobs_to_qparams = {}
with open(qparams_file) as f:
lines = f.readlines()
for line in lines:
op_id, op_type, output_id, tensor_name, mini, maxi, scale, zero_point, precision = (
line.split()
)
op_id = int(op_id)
output_id = int(output_id)
op = net.Proto().op[op_id]
if op_type != op.type or op.output[output_id] != tensor_name:
print(
"Corrupt qparams file {} {} {} {} {}".format(
qparams_file, op_type, op.type, op.output[output_id], tensor_name
)
)
blobs_to_qparams[tensor_name] = QuantizationParam(float(scale), int(zero_point))
```
After this diff this can be simplified to
```
blobs_to_qparams = {}
for op in init_net.Proto().op:
for output in op.output:
scale, zero_point = dnnlowp_pybind11.ChooseQuantizationParams(output)
blobs_to_qparams[output] = QuantizationParam(scale, zero_point)
```
Reviewed By: dskhudia
Differential Revision: D14544694
fbshipit-source-id: 4fd06cd63256201e2e9d15c39f503138d1be53c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18240
For rare cases when dst_bin_width == 0 we should just put all numbers to an arbitrary bin.
Reviewed By: csummersea
Differential Revision: D14544685
fbshipit-source-id: 02d04ff8bd1555d6cf7e7eeb1196a4ab3325a9e5
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.
This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915
Differential Revision: D14437338
Pulled By: soumith
fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17939
Instead of just asserting min <= 0 and max >= 0 , we adjust histogram to include 0 in the range.
We need to include 0 in the range during norm error minimization to correctly represent our quantization method that includes 0.
Reviewed By: csummersea
Differential Revision: D14428732
fbshipit-source-id: 6669a9d2c7d409ec3b31aee0afe48071986b9b71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17764
Original commit changeset: f1923fdca4a1
reverted int8 ops fixes the original runtime regression.
We'll ignore the memory regression since it is flaky, see D14228484
Reviewed By: dzhulgakov
Differential Revision: D13885233
fbshipit-source-id: ccbe4b94acb44b7b4cb3ae4d73e3f6091e1e1195
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17456
Using an instruction sequence similar to function in fbgemm/src/QuantUtilAvx2.cc
elementwise_sum_benchmark added
Reviewed By: protonu
Differential Revision: D14205695
fbshipit-source-id: 84939c9d3551f123deec3baf7086c8d31fbc873e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17105
To make FC with rowwise quantization faster, reduce code duplication, and make code consistent with Convolution
Reviewed By: csummersea
Differential Revision: D14080461
fbshipit-source-id: 2b0e67b86e7e3029c90751a8824bf80ae1223680
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17145
Prepacked weight contains both weight and bias, so the bias should be obtained from input index 1, not from 2
Reviewed By: jianyuh
Differential Revision: D14097281
fbshipit-source-id: b8b836b85a7b240e2fd1734377c46d9bf2ce3390