High-performance CPU KV-cache quantization engine for LLM inference (~10× speedup, 4× memory reduction) with Python & PyTorch support.