pytorch/third_party
Nikhil Gupta 94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
..
benchmark@0d98dba29d
composable_kernel@50ee4267e2 [AMD] [submodule] aten.bmm CK-backend prototype (#140758) 2024-12-03 06:54:51 +00:00
cpp-httplib@3b6597bba9
cpuinfo@1e83a2fdd3 Update cpuinfo submodule (#138351) 2024-10-19 01:12:29 +00:00
cudnn_frontend@936021bfed [BE]: Update cudnn_frontend submodule to 1.8.0 (#138709) 2024-10-26 01:55:33 +00:00
cutlass@bbe579a9e3 Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)" 2024-08-16 18:09:33 +00:00
eigen@3147391d94
fbgemm@dbc3157bf2
flatbuffers@01834de25e
fmt@0c9fce2ffe [BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036) 2024-07-29 15:50:00 +00:00
FP16@4dfe081cf6
FXdiv@b408327ac2
gemmlowp
gloo@5354032ea0
googletest@b514bdc898 [EZ][BE] Update googletest submodule (#140988) 2024-11-19 07:49:16 +00:00
ideep@e026f3b031 Upgrade submodule ideep for bf16f32 matmul changes (#143508) 2024-12-19 06:49:16 +00:00
ittapi@5b8a7d7422
kineto@bc1616a65c update kineto to XPU Windows fixed PR. [submodule kineto] (#143445) 2024-12-20 05:57:30 +00:00
kleidiai@202603f38a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124) 2024-12-20 19:32:03 +00:00
mimalloc@b66e3214d8
miniz-3.0.2 [miniz] Make sure miniz extra_size_remaining doesn't go off bound (#141266) 2024-11-21 22:02:28 +00:00
nccl [BE]: Update NCCL submodule to 2.21.5 (#124014) 2024-07-02 14:39:33 +00:00
nlohmann@87cda1d664
NNPACK@c07e3a0400
NVTX@e170594ac7 [Reland2] Update NVTX to NVTX3 (#109843) 2024-08-20 16:33:26 +00:00
onnx@b8baa84466 [Submodule] update submodule onnx==1.17.0 (#139128) 2024-10-31 02:50:00 +00:00
opentelemetry-cpp@a799f4aed9
pocketfft@9d3ab05a7f
protobuf@d1eca4e4b4
psimd@072586a71b
pthreadpool@4fe0e1e183
pybind11@a2e59f0e70 [BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087) 2024-09-14 20:48:44 +00:00
python-peachpy@f45429b087
sleef@60e76d2bce
tensorflow_cuda_bazel_build/cuda
tensorpipe@52791a2fd2
valgrind-headers
VulkanMemoryAllocator@a6bfc23725
XNNPACK@4ea82e595b Update XNNPACK Version (#139913) 2024-11-18 18:16:31 +00:00
BUCK.oss [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041) 2024-11-09 00:13:16 +00:00
BUILD
build_bundled.py Fix manual licensing (#128630) 2024-06-14 00:12:09 +00:00
cpp-httplib.BUILD Reapply "distributed debug handlers (#126601)" (#127805) 2024-06-04 19:44:30 +00:00
cuda.BUILD [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489) 2024-08-15 17:11:52 +00:00
cudnn_frontend.BUILD
cudnn.BUILD
cutlass.BUILD [BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686) 2024-09-28 21:11:15 +00:00
eigen.BUILD
fmt.BUILD
generate-cpuinfo-wrappers.py
generate-xnnpack-wrappers.py Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724) 2024-09-04 08:45:46 +00:00
glog.buck.bzl
gloo.BUILD
ideep.BUILD
kineto.buck.bzl [lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions 2024-07-19 07:19:11 -07:00
kineto.BUILD
LICENSES_BUNDLED.txt Fix manual licensing (#128630) 2024-06-14 00:12:09 +00:00
METADATA.bzl
mkl_headers.BUILD
mkl-dnn.BUILD Add oneDNN BRGEMM support on CPU (#131878) 2024-09-07 13:22:30 +00:00
mkl.BUILD [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051) 2024-05-31 01:20:45 +00:00
nlohmann.BUILD [aoti] Add initial custom op support (#127034) 2024-07-24 20:29:55 +00:00
onnx.BUILD
opentelemetry-cpp.BUILD
README.md
sleef.BUILD
sleef.bzl
substitution.bzl
tensorpipe.BUILD
xnnpack_buck_shim.bzl Update XNNPACK Version (#139913) 2024-11-18 18:16:31 +00:00
xnnpack_src_defs.bzl Update XNNPACK Version (#139913) 2024-11-18 18:16:31 +00:00
xnnpack_wrapper_defs.bzl Update XNNPACK Version (#139913) 2024-11-18 18:16:31 +00:00
xnnpack.buck.bzl [Fast Packing] Add packing ukernels to gemm config (#142191) 2024-12-10 01:06:17 +00:00
xpu.txt Update torch-xpu-ops commit pin (#142113) 2024-12-05 17:00:29 +00:00

This folder contains vendored copies of third-party libraries that we use.