Commit Graph

118 Commits

Author SHA1 Message Date
Haixin Liu
7f130c8494 Expose the quantized inputs and output of dynamic quantized int8 FC operator for debugging (#23566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23566

Currently if we use dynamic quantization we don't have the access to the internally quantized inputs and output for debugging.

To make the debugging easier, this diff adds a debug feature to expose the quantized X, W and Y for debugging if debug outputs are attached to the operator and caffe2_dnnlowp_force_slow_path flag is set.

The quantized inputs and output are exposed as the extra outputs.

The example Int8FC op with debug outputs appended looks like:
```
op {
  input: "X"
  input: "W"
  input: "b"
  output: "Y"
  output: "X_q"
  output: "W_q"
  output: "Y_q"
  name: ""
  type: "Int8FC"
  arg {
    name: "axis"
    i: 1
  }
  ...
}
```

Next need to expose the quantization parameters.

Reviewed By: jspark1105

Differential Revision: D16566753

fbshipit-source-id: acd855a172ee7993ddba8808f2af81b628ff9c02
2019-08-02 21:23:43 -07:00
Yinghai Lu
b964bdb53a Fbgemm fp16 tensor support (#23101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23101

Support for
- Shape inference
- Tensor info extraction

Reviewed By: zrphercule

Differential Revision: D16345251

fbshipit-source-id: 53ef674b5b1581e6267e6d2070e34355280dae79
2019-07-19 17:08:03 -07:00
Jongsoo Park
738aba171b use caffe2_dnnlowp_force_slow_path in FC (#22143)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22143

Like Conv DNNLOWP operator, allow FC to run the slow path to debug numerical issues caused by Intel's int8 instruction that does horizontal addition of 2 int8 multiplication results in 16 bit

Reviewed By: hx89

Differential Revision: D15966885

fbshipit-source-id: c6726376a3e39d341fd8aeb0e54e0450d2af8920
2019-07-08 17:01:04 -07:00
Jongsoo Park
040a4bd914 include conv_op_impl.h from conv_dnnlowp_op.cc (#22458)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22458

To make sure template instantiation.

Reviewed By: jianyuh

Differential Revision: D16094183

fbshipit-source-id: 7861df0b303bec42ab80a53477c4b608edebb61d
2019-07-02 15:09:34 -07:00
Liuyi Jin
f5a1ea170b SIMD version average pooling added (#22148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22148

Average pooling is added into dnnlowp optimization code.

Reviewed By: jspark1105

Differential Revision: D15936556

fbshipit-source-id: 6177ee62529801898f230c6fb89e9c4b598593a5
2019-06-25 12:19:21 -07:00
Jongsoo Park
b19b20efef fix minor comment (#21576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21576

Fix comment regarding original_tensor

Reviewed By: jianyuh

Differential Revision: D15733294

fbshipit-source-id: e2957f32dcf90859b77e61c931b64abdd066aabb
2019-06-21 22:23:53 -07:00
Jongsoo Park
5d7cf66862 add Int8SpatialBNRelu (#22014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22014

Add Int8SpatialBN + Relu fused operator.

Reviewed By: dskhudia

Differential Revision: D15916551

fbshipit-source-id: a938e0f0e105ab5f823a3cb6144f50aa2ab944c1
2019-06-20 23:23:04 -07:00
Jongsoo Park
95aee81dd7 more general fusion logic (#22015)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22015

Previous fusion logic only works for operators back-to-back in the linear order of protobuf file.
This diff generalizes to work for any predecessor-successor operators in the graph without any "interfering" use/def of the related blobs.

Reviewed By: csummersea

Differential Revision: D15916709

fbshipit-source-id: 82fe4911a8250845a8bea3427d1b77ce2442c495
2019-06-20 20:44:26 -07:00
Summer Deng
97ea44b34a Fix issue in quantization error measurement when followed by Relu (#21890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21890

As title

Reviewed By: jspark1105

Differential Revision: D15739808

fbshipit-source-id: 8fbcca04f0711fd9f994d67e1f4a604ef9fa42c6
2019-06-19 22:29:54 -07:00
Jongsoo Park
1ffa9d3d3b correct measure quantization error when followed_by=Relu and dequantize_output=1 (#21664)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21664

As title

Reviewed By: csummersea

Differential Revision: D15770947

fbshipit-source-id: 57f5842e1a250300703b02134c314e4f06b767b8
2019-06-11 23:36:15 -07:00
Jongsoo Park
afd202be9f StoreMatrixInMatrixMarketFormat can store both integer and float tensors (#21606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21606

StoreMatrixInMatrixMarketFormat was able to dump quantized tensors only but sometimes we want to dump float tensors.

Reviewed By: csummersea

Differential Revision: D15741611

fbshipit-source-id: 95b03c2fdf1bd8407f7d925171d9dc9f25677464
2019-06-11 17:28:19 -07:00
Rui Zhu
2b902e9738 Fix the offset numerical bug when casting (#21484)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21484

cast<int32_t*> => cast<int32_t>

Also fixed reserve problem which might cause incorrect pointer.

Reviewed By: yinghai

Differential Revision: D15699866

fbshipit-source-id: 374418476bddd60f5c5306c8c57319ccf28b9990
2019-06-07 12:33:18 -07:00
Daya Khudia
80a083ef92 Remove unneeded headers (#21393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21393

Result of splitting the base diff. We moved a header from src/* to include/fbgemm/*

Reviewed By: jianyuh

Differential Revision: D15635188

fbshipit-source-id: ad7d0ddba964ff1cb8b2e33f5f98e457a4d2eac9
2019-06-06 14:23:54 -07:00
Yinghai Lu
cf7ef5e631 Add onnxifi support for Int8FCDNNLowPPackedWeightBlob (#20564)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20564

Reviewed By: bddppq

Differential Revision: D15106712

fbshipit-source-id: 428db9c23cfd36ddedc8d79121fbbb3bb484c993
2019-05-20 16:57:11 -07:00
Jongsoo Park
101176870e eliminate FE_INVALID exceptions related to fp16 conversion (#20390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390

duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop.

Reviewed By: hx89

Differential Revision: D15307358

fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9
2019-05-13 23:42:01 -07:00
Yinghai Lu
56977db4a7 Provide option to save quantized data for DNNLOWP without layout optimization (#19681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19681

For accelerator, we need to lower just the quantized weights data without layout transformation. This diff attempts to provide this option.

Reviewed By: jerryzh168, zrphercule

Differential Revision: D15066568

fbshipit-source-id: 133d749e087c2ad4a899bee5e96f597f70b2443c
2019-04-30 12:32:42 -07:00
Daya S Khudia
d868c97580 Improve performance of Int8SpatialBN (needed for DF4 quantization) (#19702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19702

avx2 implementation of core compute for Int8SpatialBN

Reviewed By: jianyuh

Differential Revision: D15073973

fbshipit-source-id: c30b0c621348ba9331ba5e48b281c00cf6e479a1
2019-04-30 10:26:48 -07:00
Summer Deng
cbd0a2d3c9 Fix the depthwise 3x3x3 fast path criteria for the stride (#19692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19692

Remove the requirement on stride for the optimized depthwise 3x3x3 kernels.

Reviewed By: jspark1105

Differential Revision: D15070214

fbshipit-source-id: 9fe2d8e96930166e4eb0e2dd2288f6a0c4831e0a
2019-04-24 21:35:27 -07:00
Jongsoo Park
ffc9e29844 unit test with multiple op invocations (#19118)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19118

A bug introduced by D14700576 reported by Yufei (fixed by D14778810 and D14785256) was not detected by our units tests.
This diff improves unit tests to catch such errors (with this diff and without D14778810, we can reproduce the bug Yufei reported).
This improvement also revealed a bug that affects the accuracy when we pre-pack weight and bias together and the pre-packed weight/bias are used by multiple nets. We were modifying the pre-packed bias in-place which was supposed to be constants.

Reviewed By: csummersea

Differential Revision: D14806077

fbshipit-source-id: aa9049c74b6ea98d21fbd097de306447a662a46d
2019-04-15 14:41:28 -07:00
Summer Deng
496b0b03d9 amend D14778810 (#18902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18902

Fix in D14778810 had an issue that when we fallback to acc32 because the density of outlier is too high W_quantized_ is already modified. In this diff we first just count the number of outliers (without modifying W_quantized_) and only when density is low enough and no need for fallback we modify W_quantized_ and construct an outlier matrix.

Reviewed By: jspark1105

Differential Revision: D14785256

fbshipit-source-id: 03933110a4ca7409686a06b18a9bb921f8657950
2019-04-09 22:08:54 -07:00
Summer Deng
02968398d5 Fix a dev mode bug in activation distribution observer (#19004)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19004

Handling the exception case when the data has min 3.40282e+38 max -3.40282e+38

Reviewed By: jspark1105

Differential Revision: D14822193

fbshipit-source-id: b9771d1584fdf8317f5b8c7f5806be5d27314386
2019-04-08 09:36:50 -07:00
Summer Deng
907b4c5890 fix bug when falling back to acc32 when weight is prepacked (#18974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18974

When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.

Reviewed By: bddppq

Differential Revision: D14814067

fbshipit-source-id: aec917322de695e283f0aca1e930c5603d196404
2019-04-06 21:53:08 -07:00
Junjie Bai
46fe266507 Revert D14778810: [caffe2/int8] fix bug when falling back to acc32 when weight is prepacked
Differential Revision:
D14778810

Original commit changeset: d49a8c4b7c81

fbshipit-source-id: 15568b084848de74437582548bec42aadc74080d
2019-04-05 14:01:33 -07:00
Summer Deng
28990f34d9 fix bug when falling back to acc32 when weight is prepacked (#18881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/18878

When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.

TODO: add unit tests with better coverage

Reviewed By: feiyu1990

Differential Revision: D14778810

fbshipit-source-id: d49a8c4b7c815ab29b77feb53ee730ad63780488
2019-04-05 13:00:26 -07:00
Jongsoo Park
fa0ad057f8 fold col offset into bias; optimize A symmetric quant (#17026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17026

D14013931 was for FC. This diff is similar optimizations for Conv.
A subtle difference is that in FC, once we fold col_offset into bias during pre-processing step, we can treat everything as if A_zero_offset == 0 (symmetric quantization of A).
In Conv, we can't do this because padding still needs to use the original A_zero_offset.
From requantization point of view, once col_offset folded into bias, we can treat as if we're doing symmetric A quantization.
But, for steps involving padding like im2col, im2col fused with packing, and direct conv for depth-wise/group convolution we still need to pass the original A_zero_offset.

Reviewed By: jianyuh

Differential Revision: D14020276

fbshipit-source-id: c29caefd1127bbc6aff0e9d535939bb0c1ecb66c
2019-04-03 22:52:54 -07:00
Jongsoo Park
06b7fe59f2 use optimization in D14020675 (#16945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16945

As title

Reviewed By: jianyuh

Differential Revision: D14020769

fbshipit-source-id: fc0f05fcc57bfe9b4aa0c5750060d7b2ba57dd7a
2019-04-03 08:05:10 -07:00
Jongsoo Park
f084c129db add Int8FCRelu (#18673)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18673

Add a fused FC + Relu

Reviewed By: csummersea

Differential Revision: D14667055

fbshipit-source-id: d88fefba008fc0ca450291532d2b320694c6b785
2019-04-01 23:50:30 -07:00
Junjie Bai
246f5c412e Revert "Tensor construction codemod(raw_mutable_data) (#16373)" (#18680)
Summary:
This reverts commit d73c830e23.

We have observed significant perf drop when training ResNext101 with multiple amd GPUs:

Before:
https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-bench/1636/console
2 GPUs ResNext training got 150\~160 imgs/sec
4 GPUs ResNext training got 270\~280 imgs/sec

After:
https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-clang7-rocmdeb-ubuntu16.04-bench/1637/console
Both 2 and 4 GPUs ResNext training drop to 110\~120 imgs/sec

Similar perf drop are seen on ResNet50 training jobs as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18680

Differential Revision: D14702941

Pulled By: bddppq

fbshipit-source-id: 828141805afc23f25c08d4a2eb6d4b99f817c128
2019-04-01 14:39:13 -07:00
Jongsoo Park
89e9b1cf8e add ConvRelu schema (#18693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18693

As title

Reviewed By: protonu

Differential Revision: D14662880

fbshipit-source-id: 3664faa660a04e1f528a413d2a1700b872c3c684
2019-04-01 13:09:07 -07:00
Jongsoo Park
822c8ee143 use acc16 only when n>128 and k>128 in Skylake (#18672)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18672

In Skylake, when n < 128 or k < 128, acc16 is slower.

Reviewed By: jianyuh

Differential Revision: D14700576

fbshipit-source-id: 80ca9f1af4626637eed9c5ca49f95ae744811189
2019-04-01 08:52:28 -07:00
Jongsoo Park
505d50ea90 handle a rare case of histogram min is inf/nan (#18239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18239

When min is inf or nan, we get UBSAN errors

Reviewed By: csummersea

Differential Revision: D14537668

fbshipit-source-id: e70ffb5ecd2b10793356070c69fdabf8f25b203e
2019-03-31 21:32:54 -07:00
Jerry Zhang
d73c830e23 Tensor construction codemod(raw_mutable_data) (#16373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16373

motivation: https://github.com/pytorch/pytorch/pull/12407
This is a manual diff.
most of the fixes should be:

```
auto* Y = Output(0);
Y->Resize(dims);
Y->raw_mutable_data(dtype);
```
-->
```
auto* Y = Output(0, dims, at::dtype(dtype));
```
But there might be other cases.

Reviewed By: dzhulgakov

Differential Revision: D13725460

fbshipit-source-id: 649a4b0e42f62cda1a60171dd9fa3e440dc9dca1
2019-03-29 18:36:46 -07:00
Summer Deng
7c438c82eb Change dnnlowp log level from warning to v2 (#18576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18576

As in title

Reviewed By: feiyu1990

Differential Revision: D14670898

fbshipit-source-id: 1983099b2ba57daab393278553f10dcdb1812fdf
2019-03-29 09:29:25 -07:00
Summer Deng
c297f26843 Add more options to the quantization model exporter (#18383)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18383

Add command line options for different quantization schemes.

Reviewed By: amylittleyang

Differential Revision: D14476862

fbshipit-source-id: 37fbf5b4c1c550121eae313f5a71d703a0a87f0f
2019-03-25 04:23:17 -07:00
Jianyu Huang
18a6781f57 Fix alignment issues for Fake BFP16 fp32 -> bfp16 rounding routines (#18321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18321

As title.

Reviewed By: jspark1105

Differential Revision: D14575512

fbshipit-source-id: 0e33cdab54b1aef8b67f0b4c366692c5dbdf631d
2019-03-22 12:41:58 -07:00
Jongsoo Park
77a7285764 add more Python interface functions to make quantization simpler (#18246)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18246

Simplifies histogram collection and quantization process.

Histogram collection before this diff was something like this
```
from caffe2.quantization.server import dnnlowp_pybind11
...
dnnlowp_pybind11.ObserveHistogramOfOutput(hist_file)
for ...
   workspace.RunNet(predict_net)
dnnlowp_pybind11.ClearNetObservers()  # This is to trigger Stop function in the observer to dump out histogram file but this can have unintended consequence of also clearing all the other useful observers we attached
```

After this diff we can
```
workspace.CreateNet(predict_net)  # Note we need to create net to have a net to attach observer
histogram_observer = dnnlowp_pybind11.AddHistogramObserver(predic_net, hist_file)
for ...
   workspace.RunNet(predict_net)
predict_net.RemoveObserver(histogram_observer)
```

Choosing quantization parameters of weights before this diff was something like this
```
dnnlowp_pybind11.ObserveHistogramOfOutput(weight_hist_file)
workspace.RunNetOnce(init_net)
dnnlowp_pybind11.ClearNetObservers() # Has same issue as the histogram collection example above

dnnlowp_pybind11.RegisterQuantizationParamsWithHistogram(
    weight_hist_file, is_weight=True, qparams_output_file_name=qparams_file
)
workspace.CreateNet(init_net, overwrite=True)
dnnlowp_pybind11.ClearNetObservers()

logger.info("Loading quantization params from {}".format(qparams_file))
blobs_to_qparams = {}
with open(qparams_file) as f:
    lines = f.readlines()
for line in lines:
    op_id, op_type, output_id, tensor_name, mini, maxi, scale, zero_point, precision = (
        line.split()
    )
    op_id = int(op_id)
    output_id = int(output_id)
    op = net.Proto().op[op_id]
    if op_type != op.type or op.output[output_id] != tensor_name:
        print(
            "Corrupt qparams file {} {} {} {} {}".format(
                qparams_file, op_type, op.type, op.output[output_id], tensor_name
            )
        )
    blobs_to_qparams[tensor_name] = QuantizationParam(float(scale), int(zero_point))

```

After this diff this can be simplified to
```
blobs_to_qparams = {}
for op in init_net.Proto().op:
    for output in op.output:
        scale, zero_point = dnnlowp_pybind11.ChooseQuantizationParams(output)
        blobs_to_qparams[output] = QuantizationParam(scale, zero_point)
```

Reviewed By: dskhudia

Differential Revision: D14544694

fbshipit-source-id: 4fd06cd63256201e2e9d15c39f503138d1be53c2
2019-03-22 00:52:24 -07:00
Junjie Bai
46439c78d0 Replace the remaining usages of IntList in caffe2 to IntArrayRef
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18282

Differential Revision: D14569269

Pulled By: bddppq

fbshipit-source-id: 5fc33701b83f9efdec4b456d2691764831d10e7f
2019-03-21 16:34:38 -07:00
Jongsoo Park
bbbabda4e8 handle dst_bin_width==0 case properly (#18240)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18240

For rare cases when dst_bin_width == 0 we should just put all numbers to an arbitrary bin.

Reviewed By: csummersea

Differential Revision: D14544685

fbshipit-source-id: 02d04ff8bd1555d6cf7e7eeb1196a4ab3325a9e5
2019-03-20 17:11:25 -07:00
Jongsoo Park
87b6cbb6fd fix bug in pool_dnnlowp_op_avx2.cc (#18141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18141

VLEN should've been 32

Reviewed By: jianyuh

Differential Revision: D14510780

fbshipit-source-id: ddf12746e1c69677a268432432ddb088cc210084
2019-03-18 16:31:42 -07:00
Thomas Viehmann
13bc002422 fixes for AVX detection (#17915)
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.

This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915

Differential Revision: D14437338

Pulled By: soumith

fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
2019-03-13 03:55:06 -07:00
Jongsoo Park
92e35ac0a7 fix overly restrictive assertion (#17939)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17939

Instead of just asserting min <= 0 and max >= 0 , we adjust histogram to include 0 in the range.
We need to include 0 in the range during norm error minimization to correctly represent our quantization method that includes 0.

Reviewed By: csummersea

Differential Revision: D14428732

fbshipit-source-id: 6669a9d2c7d409ec3b31aee0afe48071986b9b71
2019-03-12 18:18:49 -07:00
Summer Deng
c10c73f047 Int8 FC performance debugging (#17700)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17700

Add performance debugging utilities in DNNLOWP FC operator and the python script

Reviewed By: amylittleyang

Differential Revision: D14321299

fbshipit-source-id: 50dbd7b352a1da5d2ecb659d8003e71e70750063
2019-03-08 19:03:54 -08:00
Jerry Zhang
ac87488bd3 Change ConvPoolOp<Context>::SetOutputSize to ConvPoolOp<Context>::GetOutputSize (#17764)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17764

Original commit changeset: f1923fdca4a1

reverted int8 ops fixes the original runtime regression.
We'll ignore the memory regression since it is flaky, see D14228484

Reviewed By: dzhulgakov

Differential Revision: D13885233

fbshipit-source-id: ccbe4b94acb44b7b4cb3ae4d73e3f6091e1e1195
2019-03-07 18:38:53 -08:00
Jongsoo Park
aea8dd8377 print warnings when DNNLOWP_16 or DNNLOWP_ROWWISE_16 engine is used (#17176)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17176

As title

Reviewed By: csummersea

Differential Revision: D14111616

fbshipit-source-id: 1282cb2452c4ad385fd2dc6d3f8c19e9fec715ff
2019-03-04 14:28:42 -08:00
Jongsoo Park
222a07863f optimize elementwise sum (#17456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17456

Using an instruction sequence similar to function in fbgemm/src/QuantUtilAvx2.cc
elementwise_sum_benchmark added

Reviewed By: protonu

Differential Revision: D14205695

fbshipit-source-id: 84939c9d3551f123deec3baf7086c8d31fbc873e
2019-02-27 10:12:41 -08:00
Jongsoo Park
08fed51926 optimize max pool 2d (#17418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17418

Retry of D14181620 this time with CMakeLists.txt changes

Reviewed By: jianyuh

Differential Revision: D14190538

fbshipit-source-id: c59b1bd474edf6376f4c2767a797b041a2ddf742
2019-02-22 19:43:57 -08:00
Lu Fang
0c24f3754b Revert D14181620: [caffe2/int8] optimize max pool 2d
Differential Revision:
D14181620

Original commit changeset: ffc6c4412bd1

fbshipit-source-id: 4391703164a672c9a8daecb24a46578765df67c6
2019-02-22 11:23:59 -08:00
Jongsoo Park
4778a4089e optimize max pool 2d (#17391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17391

Optimize 2D max pool using AVX2 intrinsics.

Reviewed By: jianyuh

Differential Revision: D14181620

fbshipit-source-id: ffc6c4412bd1c1d7839fe06226921df40d9cab83
2019-02-22 10:36:19 -08:00
Jongsoo Park
dad0dbd3b9 merge fully_connected_rowwise_dnnlowp_op into fully_connected_dnnlowp_op (#17105)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17105

To make FC with rowwise quantization faster, reduce code duplication, and make code consistent with Convolution

Reviewed By: csummersea

Differential Revision: D14080461

fbshipit-source-id: 2b0e67b86e7e3029c90751a8824bf80ae1223680
2019-02-15 09:50:11 -08:00
Jongsoo Park
90fc6133b2 bug fix when we prepack weight and bias together (#17145)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17145

Prepacked weight contains both weight and bias, so the bias should be obtained from input index 1, not from 2

Reviewed By: jianyuh

Differential Revision: D14097281

fbshipit-source-id: b8b836b85a7b240e2fd1734377c46d9bf2ce3390
2019-02-15 09:21:20 -08:00