Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46457
Wanted to see if using CopyMatrix specialized for float that uses mkl_somatcopy can be faster but it wasn't. Still want to check in benchmark that can be used later.
Test Plan: .
Reviewed By: dskhudia
Differential Revision: D24345901
fbshipit-source-id: d3e68dbb560e3138fda11c55789cd41bc0715c6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45551
The FP16 version of SparseNormalize op in Caffe2 is missing. This Diff adds FP16 support to unblock MC process of adding FP16 to Dper3.
Check https://fb.quip.com/L0T2AXGwUY3n#EReACAeifk3 .
One question is whether the pure FP16 Sparse Normalized op will affect the accuracy? Maybe we should do it in FP32 domain.
ghstack-source-id: 114184398
Test Plan:
```
buck run mode/opt //caffe2/caffe2/python/operator_test:sparse_normalize_test
```
```
buck run mode/opt -c python.package_style=inplace mode/no-gpu //caffe2/caffe2/python/benchmarks:sparse_normalize_benchmark -- --fp16
```
Reviewed By: jspark1105
Differential Revision: D24005618
fbshipit-source-id: 8b918ec4063fdaafa444779b95206ba2b7b38537
Summary:
There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports:
```2to3 -f future -w caffe2```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033
Reviewed By: seemethere
Differential Revision: D23808648
Pulled By: bugra
fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985
avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels.
This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale.
Test Plan:
In my devserver
for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done
Before this diff
2-bit
3.35394 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
3.60351 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.434467 ms. 100%. FloatToFused8BitRowwiseQuantized
After this diff
2-bit
0.606386 ms. 100%. FloatToFused2BitRowwiseQuantized
4-bit
0.446683 ms. 100%. FloatToFused4BitRowwiseQuantized
8-bit
0.4349 ms. 100%. FloatToFused8BitRowwiseQuantized
Reviewed By: choudharydhruv, jianyuh
Differential Revision: D22033195
fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34903
Reattempt of D20461609
Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM
Test Plan: CI
Reviewed By: jianyuh
Differential Revision: D20495304
fbshipit-source-id: 66a99677583f50fd40e29c514710c7b1a8cdbc29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34783
Moving 2/4-bit SLS and row-wise 2/4-bit conversion operator to open source to be used by DLRM
Test Plan: CI
Reviewed By: yinghai
Differential Revision: D20461609
fbshipit-source-id: b3ef73ff10f2433afe06ffa73fe1145282d9ec4c