Vllm Python Packages

smg-grpc-proto

Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat history, tokenization caching, Responses API, embeddings, WASM plugins, MCP, and multi-tenant auth.

829K 206 62

smg-grpc-servicer

560K 206 62

mooncake-transfer-engine

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

338K 5K 720

lmcache

Supercharge Your LLM with the Fastest KV Cache Layer

120K 8K 1K

kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

114K 5K 1K

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

71K 1K 125

conch-triton-kernels

A "standard library" of Triton kernels.

44K 24 3

xinference

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

43K 9K 824

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

38K 1K 185

vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

31K 6 0

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

18K 1K 125

ramalama

RamaLama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of containers.

11K 3K 337

projectdavid-platform

A single pip installed package will orchestrate a production ready instance of the AI stack in any environment

11K 1 0

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

9K 1K 125

turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

8K 46 5

uc-manager

Persist and reuse KV Cache to speedup your LLM.

7K 274 73

vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

7K 2K 1K

mooncake-transfer-engine-cuda13

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

7K 5K 720

sparrow-parse

Structured data extraction and instruction calling with ML, LLM and Vision LLM

6K 5K 515

logsentinelai

LLM-powered security log analyzer: detect threats & anomalies with zero regex — just declare a Pydantic schema. Real-time Telegram alerts, SIEM-ready with Elasticsearch/Kibana. Supports OpenAI, Ollama, vLLM.

5K 46 9

flash-head

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

4K 6 1

terradev-cli

Cross-Cloud Compute Optimization Platform with Migration & Evaluation - v4.0.12

3K 10 1

smg

3K 206 62

vllm-cpu-avx512bf16

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

3K 6 0

Search Packages