Vllm Python Packages

smg-grpc-proto

Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat history, tokenization caching, Responses API, embeddings, WASM plugins, MCP, and multi-tenant auth.

868K 206 62

smg-grpc-servicer

593K 206 62

mooncake-transfer-engine

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

348K 5K 720

kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

116K 5K 1K

lmcache

Supercharge Your LLM with the Fastest KV Cache Layer

112K 8K 1K

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

73K 1K 125

conch-triton-kernels

A "standard library" of Triton kernels.

45K 24 3

xinference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

44K 9K 824

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

39K 1K 185

vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

29K 6 0

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

18K 1K 125

ramalama

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

11K 3K 337

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

9K 1K 125

projectdavid-platform

A single pip installed package will orchestrate a production ready instance of the AI stack in any environment

9K 1 0

mooncake-transfer-engine-cuda13

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

8K 5K 720

vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

7K 2K 1K

turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

7K 46 5

uc-manager

Persist and reuse KV Cache to speedup your LLM.

7K 274 73

sparrow-parse

Structured data extraction and instruction calling with ML, LLM and Vision LLM

6K 5K 515

logsentinelai

LLM-powered security log analyzer: detect threats & anomalies with zero regex — just declare a Pydantic schema. Real-time Telegram alerts, SIEM-ready with Elasticsearch/Kibana. Supports OpenAI, Ollama, vLLM.

5K 46 9

flash-head

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

4K 6 1

vllm-cpu-avx512bf16

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

3K 6 0

terradev-cli

Cross-Cloud Compute Optimization Platform with Migration & Evaluation - v4.0.12

3K 10 1

smg

3K 206 62