Quantization Python Packages

ctranslate2

Fast inference engine for Transformer models

8.3M 4K 478

faster-whisper

Faster Whisper transcription with CTranslate2

7.4M 23K 2K

bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

6.5M 8K 847

torchao

PyTorch native quantization and sparsity for training and inference

3.4M 3K 502

optimum

🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

1.7M 3K 639

onnx2tf

A tool for converting ONNX files to LiteRT/TFLite/TensorFlow, PyTorch native code (nn.Module), TorchScript (.pt), state_dict (.pt), Exported Program (.pt2), and Dynamo ONNX. It also supports direct conversion from LiteRT to PyTorch.

1.5M 953 99

nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference

456K 1K 293

llmcompressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

285K 3K 498

optimum-quanto

A pytorch quantization backend for optimum

267K 1K 86

sageattention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

156K 3K 403

tensorflow-model-optimization

A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.

105K 2K 347

auto-gptq

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

75K 5K 540

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

71K 1K 125

navec

Compact high quality word embeddings for Russian language

51K 218 19

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

38K 1K 185

quantcpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

38K 386 42

intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

36K 2K 315

tqdb

Embedded vector database in Rust with Python bindings — TurboQuant algorithm (arXiv:2504.19874), zero training, 2–4 bit compression, HNSW ANN search, WAL persistence

31K 2 0

aimet-torch

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

29K 3K 450

llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

29K 71K 9K

neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

22K 3K 304

qonnx

QONNX: Arbitrary-Precision Quantized Neural Networks in ONNX

21K 184 57

brevitas

Brevitas: neural network quantization in PyTorch

21K 2K 243

aimet-onnx

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

19K 3K 450

Search Packages