Commit Graph

101 Commits

Author SHA1 Message Date
Summer Deng
cbd0a2d3c9 Fix the depthwise 3x3x3 fast path criteria for the stride (#19692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19692

Remove the requirement on stride for the optimized depthwise 3x3x3 kernels.

Reviewed By: jspark1105

Differential Revision: D15070214

fbshipit-source-id: 9fe2d8e96930166e4eb0e2dd2288f6a0c4831e0a
2019-04-24 21:35:27 -07:00
Jongsoo Park
ffc9e29844 unit test with multiple op invocations (#19118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19118

A bug introduced by D14700576 reported by Yufei (fixed by D14778810 and D14785256) was not detected by our units tests.
This diff improves unit tests to catch such errors (with this diff and without D14778810, we can reproduce the bug Yufei reported).
This improvement also revealed a bug that affects the accuracy when we pre-pack weight and bias together and the pre-packed weight/bias are used by multiple nets. We were modifying the pre-packed bias in-place which was supposed to be constants.

Reviewed By: csummersea

Differential Revision: D14806077

fbshipit-source-id: aa9049c74b6ea98d21fbd097de306447a662a46d
2019-04-15 14:41:28 -07:00
Summer Deng
496b0b03d9 amend D14778810 (#18902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18902

Fix in D14778810 had an issue that when we fallback to acc32 because the density of outlier is too high W_quantized_ is already modified. In this diff we first just count the number of outliers (without modifying W_quantized_) and only when density is low enough and no need for fallback we modify W_quantized_ and construct an outlier matrix.

Reviewed By: jspark1105

Differential Revision: D14785256

fbshipit-source-id: 03933110a4ca7409686a06b18a9bb921f8657950
2019-04-09 22:08:54 -07:00
Summer Deng
02968398d5 Fix a dev mode bug in activation distribution observer (#19004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19004

Handling the exception case when the data has min 3.40282e+38 max -3.40282e+38

Reviewed By: jspark1105

Differential Revision: D14822193

fbshipit-source-id: b9771d1584fdf8317f5b8c7f5806be5d27314386
2019-04-08 09:36:50 -07:00
Summer Deng
907b4c5890 fix bug when falling back to acc32 when weight is prepacked (#18974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18974

When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.

Reviewed By: bddppq

Differential Revision: D14814067

fbshipit-source-id: aec917322de695e283f0aca1e930c5603d196404
2019-04-06 21:53:08 -07:00
Junjie Bai
46fe266507 Revert D14778810: [caffe2/int8] fix bug when falling back to acc32 when weight is prepacked
Differential Revision:
D14778810

Original commit changeset: d49a8c4b7c81

fbshipit-source-id: 15568b084848de74437582548bec42aadc74080d
2019-04-05 14:01:33 -07:00
Summer Deng
28990f34d9 fix bug when falling back to acc32 when weight is prepacked (#18881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/18878

When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.

TODO: add unit tests with better coverage

Reviewed By: feiyu1990

Differential Revision: D14778810

fbshipit-source-id: d49a8c4b7c815ab29b77feb53ee730ad63780488
2019-04-05 13:00:26 -07:00
Jongsoo Park
fa0ad057f8 fold col offset into bias; optimize A symmetric quant (#17026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17026

D14013931 was for FC. This diff is similar optimizations for Conv.
A subtle difference is that in FC, once we fold col_offset into bias during pre-processing step, we can treat everything as if A_zero_offset == 0 (symmetric quantization of A).
In Conv, we can't do this because padding still needs to use the original A_zero_offset.
From requantization point of view, once col_offset folded into bias, we can treat as if we're doing symmetric A quantization.
But, for steps involving padding like im2col, im2col fused with packing, and direct conv for depth-wise/group convolution we still need to pass the original A_zero_offset.

Reviewed By: jianyuh

Differential Revision: D14020276

fbshipit-source-id: c29caefd1127bbc6aff0e9d535939bb0c1ecb66c
2019-04-03 22:52:54 -07:00
Jongsoo Park
06b7fe59f2 use optimization in D14020675 (#16945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16945

As title

Reviewed By: jianyuh

Differential Revision: D14020769

fbshipit-source-id: fc0f05fcc57bfe9b4aa0c5750060d7b2ba57dd7a
2019-04-03 08:05:10 -07:00
Jongsoo Park
f084c129db add Int8FCRelu (#18673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18673

Add a fused FC + Relu

Reviewed By: csummersea

Differential Revision: D14667055

fbshipit-source-id: d88fefba008fc0ca450291532d2b320694c6b785
2019-04-01 23:50:30 -07:00
Junjie Bai
246f5c412e Revert "Tensor construction codemod(raw_mutable_data) (#16373)" (#18680)
Summary:
This reverts commit d73c830e23.

We have observed significant perf drop when training ResNext101 with multiple amd GPUs:

Before:
https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-bench/1636/console
2 GPUs ResNext training got 150\~160 imgs/sec
4 GPUs ResNext training got 270\~280 imgs/sec

After:
https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-bench/1637/console
Both 2 and 4 GPUs ResNext training drop to 110\~120 imgs/sec

Similar perf drop are seen on ResNet50 training jobs as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18680

Differential Revision: D14702941

Pulled By: bddppq

fbshipit-source-id: 828141805afc23f25c08d4a2eb6d4b99f817c128
2019-04-01 14:39:13 -07:00
Jongsoo Park
89e9b1cf8e add ConvRelu schema (#18693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18693

As title

Reviewed By: protonu

Differential Revision: D14662880

fbshipit-source-id: 3664faa660a04e1f528a413d2a1700b872c3c684
2019-04-01 13:09:07 -07:00
Jongsoo Park
822c8ee143 use acc16 only when n>128 and k>128 in Skylake (#18672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18672

In Skylake, when n < 128 or k < 128, acc16 is slower.

Reviewed By: jianyuh

Differential Revision: D14700576

fbshipit-source-id: 80ca9f1af4626637eed9c5ca49f95ae744811189
2019-04-01 08:52:28 -07:00
Jongsoo Park
505d50ea90 handle a rare case of histogram min is inf/nan (#18239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18239

When min is inf or nan, we get UBSAN errors

Reviewed By: csummersea

Differential Revision: D14537668

fbshipit-source-id: e70ffb5ecd2b10793356070c69fdabf8f25b203e
2019-03-31 21:32:54 -07:00
Jerry Zhang
d73c830e23 Tensor construction codemod(raw_mutable_data) (#16373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16373

motivation: https://github.com/pytorch/pytorch/pull/12407
This is a manual diff.
most of the fixes should be:

```
auto* Y = Output(0);
Y->Resize(dims);
Y->raw_mutable_data(dtype);
```
-->
```
auto* Y = Output(0, dims, at::dtype(dtype));
```
But there might be other cases.

Reviewed By: dzhulgakov

Differential Revision: D13725460

fbshipit-source-id: 649a4b0e42f62cda1a60171dd9fa3e440dc9dca1
2019-03-29 18:36:46 -07:00
Summer Deng
7c438c82eb Change dnnlowp log level from warning to v2 (#18576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18576

As in title

Reviewed By: feiyu1990

Differential Revision: D14670898

fbshipit-source-id: 1983099b2ba57daab393278553f10dcdb1812fdf
2019-03-29 09:29:25 -07:00
Summer Deng
c297f26843 Add more options to the quantization model exporter (#18383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18383

Add command line options for different quantization schemes.

Reviewed By: amylittleyang

Differential Revision: D14476862

fbshipit-source-id: 37fbf5b4c1c550121eae313f5a71d703a0a87f0f
2019-03-25 04:23:17 -07:00
Jianyu Huang
18a6781f57 Fix alignment issues for Fake BFP16 fp32 -> bfp16 rounding routines (#18321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18321

As title.

Reviewed By: jspark1105

Differential Revision: D14575512

fbshipit-source-id: 0e33cdab54b1aef8b67f0b4c366692c5dbdf631d
2019-03-22 12:41:58 -07:00
Jongsoo Park
77a7285764 add more Python interface functions to make quantization simpler (#18246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18246

Simplifies histogram collection and quantization process.

Histogram collection before this diff was something like this
```
from caffe2.quantization.server import dnnlowp_pybind11
...
dnnlowp_pybind11.ObserveHistogramOfOutput(hist_file)
for ...
   workspace.RunNet(predict_net)
dnnlowp_pybind11.ClearNetObservers()  # This is to trigger Stop function in the observer to dump out histogram file but this can have unintended consequence of also clearing all the other useful observers we attached
```

After this diff we can
```
workspace.CreateNet(predict_net)  # Note we need to create net to have a net to attach observer
histogram_observer = dnnlowp_pybind11.AddHistogramObserver(predic_net, hist_file)
for ...
   workspace.RunNet(predict_net)
predict_net.RemoveObserver(histogram_observer)
```

Choosing quantization parameters of weights before this diff was something like this
```
dnnlowp_pybind11.ObserveHistogramOfOutput(weight_hist_file)
workspace.RunNetOnce(init_net)
dnnlowp_pybind11.ClearNetObservers() # Has same issue as the histogram collection example above

dnnlowp_pybind11.RegisterQuantizationParamsWithHistogram(
    weight_hist_file, is_weight=True, qparams_output_file_name=qparams_file
)
workspace.CreateNet(init_net, overwrite=True)
dnnlowp_pybind11.ClearNetObservers()

logger.info("Loading quantization params from {}".format(qparams_file))
blobs_to_qparams = {}
with open(qparams_file) as f:
    lines = f.readlines()
for line in lines:
    op_id, op_type, output_id, tensor_name, mini, maxi, scale, zero_point, precision = (
        line.split()
    )
    op_id = int(op_id)
    output_id = int(output_id)
    op = net.Proto().op[op_id]
    if op_type != op.type or op.output[output_id] != tensor_name:
        print(
            "Corrupt qparams file {} {} {} {} {}".format(
                qparams_file, op_type, op.type, op.output[output_id], tensor_name
            )
        )
    blobs_to_qparams[tensor_name] = QuantizationParam(float(scale), int(zero_point))

```

After this diff this can be simplified to
```
blobs_to_qparams = {}
for op in init_net.Proto().op:
    for output in op.output:
        scale, zero_point = dnnlowp_pybind11.ChooseQuantizationParams(output)
        blobs_to_qparams[output] = QuantizationParam(scale, zero_point)
```

Reviewed By: dskhudia

Differential Revision: D14544694

fbshipit-source-id: 4fd06cd63256201e2e9d15c39f503138d1be53c2
2019-03-22 00:52:24 -07:00
Junjie Bai
46439c78d0 Replace the remaining usages of IntList in caffe2 to IntArrayRef
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18282

Differential Revision: D14569269

Pulled By: bddppq

fbshipit-source-id: 5fc33701b83f9efdec4b456d2691764831d10e7f
2019-03-21 16:34:38 -07:00
Jongsoo Park
bbbabda4e8 handle dst_bin_width==0 case properly (#18240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18240

For rare cases when dst_bin_width == 0 we should just put all numbers to an arbitrary bin.

Reviewed By: csummersea

Differential Revision: D14544685

fbshipit-source-id: 02d04ff8bd1555d6cf7e7eeb1196a4ab3325a9e5
2019-03-20 17:11:25 -07:00
Jongsoo Park
87b6cbb6fd fix bug in pool_dnnlowp_op_avx2.cc (#18141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18141

VLEN should've been 32

Reviewed By: jianyuh

Differential Revision: D14510780

fbshipit-source-id: ddf12746e1c69677a268432432ddb088cc210084
2019-03-18 16:31:42 -07:00
Thomas Viehmann
13bc002422 fixes for AVX detection (#17915)
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.

This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915

Differential Revision: D14437338

Pulled By: soumith

fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
2019-03-13 03:55:06 -07:00
Jongsoo Park
92e35ac0a7 fix overly restrictive assertion (#17939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17939

Instead of just asserting min <= 0 and max >= 0 , we adjust histogram to include 0 in the range.
We need to include 0 in the range during norm error minimization to correctly represent our quantization method that includes 0.

Reviewed By: csummersea

Differential Revision: D14428732

fbshipit-source-id: 6669a9d2c7d409ec3b31aee0afe48071986b9b71
2019-03-12 18:18:49 -07:00
Summer Deng
c10c73f047 Int8 FC performance debugging (#17700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17700

Add performance debugging utilities in DNNLOWP FC operator and the python script

Reviewed By: amylittleyang

Differential Revision: D14321299

fbshipit-source-id: 50dbd7b352a1da5d2ecb659d8003e71e70750063
2019-03-08 19:03:54 -08:00
Jerry Zhang
ac87488bd3 Change ConvPoolOp<Context>::SetOutputSize to ConvPoolOp<Context>::GetOutputSize (#17764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17764

Original commit changeset: f1923fdca4a1

reverted int8 ops fixes the original runtime regression.
We'll ignore the memory regression since it is flaky, see D14228484

Reviewed By: dzhulgakov

Differential Revision: D13885233

fbshipit-source-id: ccbe4b94acb44b7b4cb3ae4d73e3f6091e1e1195
2019-03-07 18:38:53 -08:00
Jongsoo Park
aea8dd8377 print warnings when DNNLOWP_16 or DNNLOWP_ROWWISE_16 engine is used (#17176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17176

As title

Reviewed By: csummersea

Differential Revision: D14111616

fbshipit-source-id: 1282cb2452c4ad385fd2dc6d3f8c19e9fec715ff
2019-03-04 14:28:42 -08:00
Jongsoo Park
222a07863f optimize elementwise sum (#17456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17456

Using an instruction sequence similar to function in fbgemm/src/QuantUtilAvx2.cc
elementwise_sum_benchmark added

Reviewed By: protonu

Differential Revision: D14205695

fbshipit-source-id: 84939c9d3551f123deec3baf7086c8d31fbc873e
2019-02-27 10:12:41 -08:00
Jongsoo Park
08fed51926 optimize max pool 2d (#17418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17418

Retry of D14181620 this time with CMakeLists.txt changes

Reviewed By: jianyuh

Differential Revision: D14190538

fbshipit-source-id: c59b1bd474edf6376f4c2767a797b041a2ddf742
2019-02-22 19:43:57 -08:00
Lu Fang
0c24f3754b Revert D14181620: [caffe2/int8] optimize max pool 2d
Differential Revision:
D14181620

Original commit changeset: ffc6c4412bd1

fbshipit-source-id: 4391703164a672c9a8daecb24a46578765df67c6
2019-02-22 11:23:59 -08:00
Jongsoo Park
4778a4089e optimize max pool 2d (#17391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17391

Optimize 2D max pool using AVX2 intrinsics.

Reviewed By: jianyuh

Differential Revision: D14181620

fbshipit-source-id: ffc6c4412bd1c1d7839fe06226921df40d9cab83
2019-02-22 10:36:19 -08:00
Jongsoo Park
dad0dbd3b9 merge fully_connected_rowwise_dnnlowp_op into fully_connected_dnnlowp_op (#17105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17105

To make FC with rowwise quantization faster, reduce code duplication, and make code consistent with Convolution

Reviewed By: csummersea

Differential Revision: D14080461

fbshipit-source-id: 2b0e67b86e7e3029c90751a8824bf80ae1223680
2019-02-15 09:50:11 -08:00
Jongsoo Park
90fc6133b2 bug fix when we prepack weight and bias together (#17145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17145

Prepacked weight contains both weight and bias, so the bias should be obtained from input index 1, not from 2

Reviewed By: jianyuh

Differential Revision: D14097281

fbshipit-source-id: b8b836b85a7b240e2fd1734377c46d9bf2ce3390
2019-02-15 09:21:20 -08:00
Jongsoo Park
0a975d333f add pre-packing operation in README.md (#17151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17151

As title

Reviewed By: jianyuh

Differential Revision: D14084272

fbshipit-source-id: e58c041e0374f6e82b337e5b6325ef06981ad8b4
2019-02-14 22:46:47 -08:00
Summer Deng
a1f2ed008f Minor fix of the histogram observer in FBL eval flows (#17118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17118

Fix the bug in quantization eval workflow; Add mul_nets option in histogram observer pybind

Reviewed By: yinghai

Differential Revision: D14085321

fbshipit-source-id: 08e3153148522ebc9512a57144d9a8ad154bb6f8
2019-02-14 22:02:04 -08:00
Jongsoo Park
92221ad840 Fold col offsets into bias; optimize A symmetric quant (#16942)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16942

We can fold col offsets into bias if zero point of activation is constant.
fbgemm still needs to provide an option to pass col offsets in case zero point of activation keep changes (e.g., dynamic quantization).
A trick to optimize static quantization case is setting A zero point to 0 after folding into bias.

This diff also optimizes when weights use symmetric quantization. When B zero point is 0, we use PackAMatrix instead of PackAWithRowOffset .

TODO:
Ideally, PackAWithRowOffset should perform as fast as PackAMatrix when B_zero_point is 0 to make client code simpler
Same in PackAWithIm2Col and depth-wise convolution (group convolution is already doing this)

Reviewed By: csummersea

Differential Revision: D14013931

fbshipit-source-id: e4d313343e2a16a451eb910beed30e35de02a40c
2019-02-12 17:33:06 -08:00
Summer Deng
b5111918cd Activation histogram net observer with multiple histogram files as output (#16855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16855

Save the histogram of each net to a separate file

Reviewed By: jspark1105

Differential Revision: D13991610

fbshipit-source-id: a5be4e37a5e63567dcd7fdf99f451ee31bb350a5
2019-02-07 19:51:30 -08:00
Xiaomeng Yang
2db847b3a7 Separate elementwise level2 math functions (#16753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16753

Separate elementwise level2 math functions

i-am-not-moving-c2-to-c10

Reviewed By: houseroad

Differential Revision: D13954928

fbshipit-source-id: 1ca7a5d3da96e32510f502e5e4e79168854bee67
2019-02-07 18:38:26 -08:00
Jongsoo Park
8105aaca86 int8 SpatialBN (#16796)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16796

SpatialBN int8 version

Reviewed By: dskhudia

Differential Revision: D13971224

fbshipit-source-id: e55fd608c161069daaa4e62c618bc14b01f32cb7
2019-02-06 15:32:01 -08:00
Jongsoo Park
30ab1773f9 call istringstream clear after str (#16820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16820

Sometimes parsing histogram was not working correctly due to changes in D13633256
We need to call istringstream clear after str

Reviewed By: csummersea

Differential Revision: D13977509

fbshipit-source-id: ce3e8cb390641d8f0b5c9a7d6d6daadffeddbe11
2019-02-06 15:23:08 -08:00
Summer Deng
a7a2618d51 Bug fix in l2 quantization (#16749)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16749

Use global quantization options in l2 quantization

Reviewed By: jspark1105

Differential Revision: D13951378

fbshipit-source-id: d4e356149587e5d2d09a6937c7fa1aa131957fd6
2019-02-04 22:31:38 -08:00
Jerry Zhang
2af95d8e3e Back out "[pt1][tensor] Change ConvPoolOp<Context>::SetOutputSize to ConvPoolOp<Context>::GetOutputSize" (#16516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16516

Original commit changeset: 64abce3dbaed

Reviewed By: dzhulgakov

Differential Revision: D13863715

fbshipit-source-id: f1923fdca4a1a82768d9c280a8493ff15a7eb2ba
2019-01-30 12:50:38 -08:00
Jerry Zhang
ff963d4b9f Change ConvPoolOp<Context>::SetOutputSize to ConvPoolOp<Context>::GetOutputSize (#16273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16273

Previously we have SetOutputSize which accept a partially initialized Output Tensor and set it to the correct size,
the diff change this to GetOutputSize that returns the correct size instead.
e.g.
```
auto* Y = Output(0);
ConvPoolOp<Context>::SetOutputSize(X, Y, channels);
...
Y->mutable_data<T>...
```
-->
```
auto sizes = ConvPoolOp<Context>::GetOutputSize(X, channels);
auto* Y = Output(0, sizes, at::dtype<T>());
```

Reviewed By: dzhulgakov

Differential Revision: D13736281

fbshipit-source-id: 64abce3dbaed0b375098463333dfd0ea5a3b1945
2019-01-28 15:56:34 -08:00
Jongsoo Park
1e19fd941f Fix formating in caffe2/quantization/server/README.md
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14237

Reviewed By: dskhudia

Differential Revision: D13751791

Pulled By: jspark1105

fbshipit-source-id: 54f73d5134e596817802c66d43098d18458c2799
2019-01-22 10:15:37 -08:00
Xiaomeng Yang
866c4e3467 Separate Moments from math and optimize it (#16175)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16175

Separate Moments from math and optimize it

i-am-not-moving-c2-to-c10

Reviewed By: houseroad

Differential Revision: D13742472

fbshipit-source-id: 90757d908d38c98ca69818855aaf68315e525992
2019-01-20 08:53:25 -08:00
Kjell Schubert
a28c0ff7b8 Allow for concurrent quantization in FullyConnectedDNNLowPOp (#16174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16174

Our service creates a new caffe2 workspace for the same underlying network on multiple threads concurrently at service startup time (later these workspaces are being reused for sequential requests), resulting in concurrent quantization via FullyConnectedDNNLowPOp calling GetOrCreateFbgemmPackBMatrix(). The lazily performed quantizations during the first inference in each workspace are all funnelled through GetOrCreateFbgemmPackBMatrix()'s cache_mutex, which means quantization is serialized, so at service startup time only a single CPU core is being used for around a minute until the serial quantization is done.
An better solution would be to avoid the quantization of the same weight matrix of the operator copies in different net copies to begin with, but this here is the simpler solution for our current problem.

Reviewed By: jspark1105

Differential Revision: D13708785

fbshipit-source-id: 537519896b3b939c552d67f400bafc8a69ce11eb
2019-01-19 06:00:22 -08:00
Jongsoo Park
964732fa8d use fbgemm gconv in dnnlowp (#16020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16020

Needs to go over more iterations. For conv, I think we need a high level interface that abstracts out low-level details of which code path will be taken (acc16, outlier-aware, depth-wise, group conv, ...) otherwise the client code will be complex as can be seen from DNNLOWP Conv ops. This will also help us to make interface more stable.

Reviewed By: dskhudia, jianyuh

Differential Revision: D13588996

fbshipit-source-id: 9afce9e441bcaf20437fcc2874fb9d4165a46bcb
2019-01-15 00:02:31 -08:00
Jongsoo Park
ca18fb8567 simplify lambda function use in conv dnnlowp ops to fix #15911 (#15996)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15996

As reported in issue #15911, gcc 4.9 was getting internal compiler error due to a complex use of lambda function in conv_dnnlowp_op.cc and conv_acc16_op.cc . This diff simplifies them.

Reviewed By: viswanathgs

Differential Revision: D13648264

fbshipit-source-id: 1551ae8a0a7653749185dca51ccceb2471b96b82
2019-01-13 23:32:48 -08:00
Jongsoo Park
04b8a2f1ba fix compile error reported in issue #15911 (#15953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15953

Fix issue reported in https://github.com/pytorch/pytorch/issues/15911

Reviewed By: csummersea

Differential Revision: D13633256

fbshipit-source-id: 3808f100ff7dedfe5e20708e72e6081ff07eb32c
2019-01-12 21:03:12 -08:00
Jongsoo Park
e5266b4ba6 3x3x3 depthwise convolution with per channel quantization (#15775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15775

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/55

fbgemm didn't have per-channel quantization for 3x3x3 depth-wise convolution

Reviewed By: jianyuh

Differential Revision: D13587438

fbshipit-source-id: 91c36fae7a0e8386e3bc49808e18918b01681dd1
2019-01-11 19:42:29 -08:00