mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-07 12:21:27 +01:00
Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet |
||
|---|---|---|
| .. | ||
| benchmark@0d98dba29d | ||
| composable_kernel@50ee4267e2 | ||
| cpp-httplib@3b6597bba9 | ||
| cpuinfo@1e83a2fdd3 | ||
| cudnn_frontend@936021bfed | ||
| cutlass@bbe579a9e3 | ||
| eigen@3147391d94 | ||
| fbgemm@dbc3157bf2 | ||
| flatbuffers@01834de25e | ||
| fmt@0c9fce2ffe | ||
| FP16@4dfe081cf6 | ||
| FXdiv@b408327ac2 | ||
| gemmlowp | ||
| gloo@5354032ea0 | ||
| googletest@b514bdc898 | ||
| ideep@e026f3b031 | ||
| ittapi@5b8a7d7422 | ||
| kineto@bc1616a65c | ||
| kleidiai@202603f38a | ||
| mimalloc@b66e3214d8 | ||
| miniz-3.0.2 | ||
| nccl | ||
| nlohmann@87cda1d664 | ||
| NNPACK@c07e3a0400 | ||
| NVTX@e170594ac7 | ||
| onnx@b8baa84466 | ||
| opentelemetry-cpp@a799f4aed9 | ||
| pocketfft@9d3ab05a7f | ||
| protobuf@d1eca4e4b4 | ||
| psimd@072586a71b | ||
| pthreadpool@4fe0e1e183 | ||
| pybind11@a2e59f0e70 | ||
| python-peachpy@f45429b087 | ||
| sleef@60e76d2bce | ||
| tensorflow_cuda_bazel_build/cuda | ||
| tensorpipe@52791a2fd2 | ||
| valgrind-headers | ||
| VulkanMemoryAllocator@a6bfc23725 | ||
| XNNPACK@4ea82e595b | ||
| BUCK.oss | ||
| BUILD | ||
| build_bundled.py | ||
| cpp-httplib.BUILD | ||
| cuda.BUILD | ||
| cudnn_frontend.BUILD | ||
| cudnn.BUILD | ||
| cutlass.BUILD | ||
| eigen.BUILD | ||
| fmt.BUILD | ||
| generate-cpuinfo-wrappers.py | ||
| generate-xnnpack-wrappers.py | ||
| glog.buck.bzl | ||
| gloo.BUILD | ||
| ideep.BUILD | ||
| kineto.buck.bzl | ||
| kineto.BUILD | ||
| LICENSES_BUNDLED.txt | ||
| METADATA.bzl | ||
| mkl_headers.BUILD | ||
| mkl-dnn.BUILD | ||
| mkl.BUILD | ||
| nlohmann.BUILD | ||
| onnx.BUILD | ||
| opentelemetry-cpp.BUILD | ||
| README.md | ||
| sleef.BUILD | ||
| sleef.bzl | ||
| substitution.bzl | ||
| tensorpipe.BUILD | ||
| xnnpack_buck_shim.bzl | ||
| xnnpack_src_defs.bzl | ||
| xnnpack_wrapper_defs.bzl | ||
| xnnpack.buck.bzl | ||
| xpu.txt | ||
This folder contains vendored copies of third-party libraries that we use.