pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

History

Nikhil Gupta 94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet		2024-12-20 19:32:03 +00:00
..
benchmark@0d98dba29d
composable_kernel@50ee4267e2	[AMD] [submodule] aten.bmm CK-backend prototype (#140758 )	2024-12-03 06:54:51 +00:00
cpp-httplib@3b6597bba9
cpuinfo@1e83a2fdd3	Update cpuinfo submodule (#138351 )	2024-10-19 01:12:29 +00:00
cudnn_frontend@936021bfed	[BE]: Update cudnn_frontend submodule to 1.8.0 (#138709 )	2024-10-26 01:55:33 +00:00
cutlass@bbe579a9e3	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )"	2024-08-16 18:09:33 +00:00
eigen@3147391d94
fbgemm@dbc3157bf2
flatbuffers@01834de25e
fmt@0c9fce2ffe	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 )	2024-07-29 15:50:00 +00:00
FP16@4dfe081cf6
FXdiv@b408327ac2
gemmlowp
gloo@5354032ea0
googletest@b514bdc898	[EZ][BE] Update googletest submodule (#140988 )	2024-11-19 07:49:16 +00:00
ideep@e026f3b031	Upgrade submodule ideep for bf16f32 matmul changes (#143508 )	2024-12-19 06:49:16 +00:00
ittapi@5b8a7d7422
kineto@bc1616a65c	update kineto to XPU Windows fixed PR. [submodule kineto] (#143445 )	2024-12-20 05:57:30 +00:00
kleidiai@202603f38a	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 )	2024-12-20 19:32:03 +00:00
mimalloc@b66e3214d8
miniz-3.0.2	[miniz] Make sure miniz extra_size_remaining doesn't go off bound (#141266 )	2024-11-21 22:02:28 +00:00
nccl	[BE]: Update NCCL submodule to 2.21.5 (#124014 )	2024-07-02 14:39:33 +00:00
nlohmann@87cda1d664
NNPACK@c07e3a0400
NVTX@e170594ac7	[Reland2] Update NVTX to NVTX3 (#109843 )	2024-08-20 16:33:26 +00:00
onnx@b8baa84466	[Submodule] update submodule onnx==1.17.0 (#139128 )	2024-10-31 02:50:00 +00:00
opentelemetry-cpp@a799f4aed9
pocketfft@9d3ab05a7f
protobuf@d1eca4e4b4
psimd@072586a71b
pthreadpool@4fe0e1e183
pybind11@a2e59f0e70	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 )	2024-09-14 20:48:44 +00:00
python-peachpy@f45429b087
sleef@60e76d2bce
tensorflow_cuda_bazel_build/cuda
tensorpipe@52791a2fd2
valgrind-headers
VulkanMemoryAllocator@a6bfc23725
XNNPACK@4ea82e595b	Update XNNPACK Version (#139913 )	2024-11-18 18:16:31 +00:00
BUCK.oss	[miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041 )	2024-11-09 00:13:16 +00:00
BUILD
build_bundled.py	Fix manual licensing (#128630 )	2024-06-14 00:12:09 +00:00
cpp-httplib.BUILD	Reapply "distributed debug handlers (#126601 )" (#127805 )	2024-06-04 19:44:30 +00:00
cuda.BUILD	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 )	2024-08-15 17:11:52 +00:00
cudnn_frontend.BUILD
cudnn.BUILD
cutlass.BUILD	[BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686 )	2024-09-28 21:11:15 +00:00
eigen.BUILD
fmt.BUILD
generate-cpuinfo-wrappers.py
generate-xnnpack-wrappers.py	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 )	2024-09-04 08:45:46 +00:00
glog.buck.bzl
gloo.BUILD
ideep.BUILD
kineto.buck.bzl	[lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions	2024-07-19 07:19:11 -07:00
kineto.BUILD
LICENSES_BUNDLED.txt	Fix manual licensing (#128630 )	2024-06-14 00:12:09 +00:00
METADATA.bzl
mkl_headers.BUILD
mkl-dnn.BUILD	Add oneDNN BRGEMM support on CPU (#131878 )	2024-09-07 13:22:30 +00:00
mkl.BUILD	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )	2024-05-31 01:20:45 +00:00
nlohmann.BUILD	[aoti] Add initial custom op support (#127034 )	2024-07-24 20:29:55 +00:00
onnx.BUILD
opentelemetry-cpp.BUILD
README.md
sleef.BUILD
sleef.bzl
substitution.bzl
tensorpipe.BUILD
xnnpack_buck_shim.bzl	Update XNNPACK Version (#139913 )	2024-11-18 18:16:31 +00:00
xnnpack_src_defs.bzl	Update XNNPACK Version (#139913 )	2024-11-18 18:16:31 +00:00
xnnpack_wrapper_defs.bzl	Update XNNPACK Version (#139913 )	2024-11-18 18:16:31 +00:00
xnnpack.buck.bzl	[Fast Packing] Add packing ukernels to gemm config (#142191 )	2024-12-10 01:06:17 +00:00
xpu.txt	Update torch-xpu-ops commit pin (#142113 )	2024-12-05 17:00:29 +00:00

README.md

This folder contains vendored copies of third-party libraries that we use.