Sglang Python Packages

smg-grpc-proto

Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat history, tokenization caching, Responses API, embeddings, WASM plugins, MCP, and multi-tenant auth.

868K 206 62

smg-grpc-servicer

593K 206 62

mooncake-transfer-engine

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

348K 5K 720

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

73K 1K 125

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

39K 1K 185

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

18K 1K 125

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

9K 1K 125

mooncake-transfer-engine-cuda13

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

8K 5K 720

dgxarley

Ansible playbooks for a 3-node K3s cluster with NVIDIA DGX Spark nodes for distributed LLM inference

5K 1 0

terradev-cli

Cross-Cloud Compute Optimization Platform with Migration & Evaluation - v4.0.12

3K 10 1

smg

3K 206 62

strands-sglang

SGLang model provider for Strands Agents for on-policy agentic RL training.

3K 52 8

mooncake-transfer-engine-non-cuda

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

3K 5K 720

infergrid

Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

2K 1 1

auto-round-hpu

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

2K 1K 125

agentic-coding-bench

Open-source benchmark for LLM inference on agentic coding workloads

2K 0 0

llm-cal

LLM inference hardware calculator — architecture-aware (MLA/NSA/MoE), engine-aware (vLLM/SGLang), honest-labeled. Reads real safetensors bytes; supports 53 GPUs (NVIDIA / AMD / Huawei Ascend / 沐曦 / 昆仑芯 / 壁仞 / 寒武纪 / 海光).

2K 1 0

kvwarden

Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

1K 2 1

kvcached

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

1K 902 107

auto-round-kernel

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

714 1K 125

anchor-vision

Python client for Anchor — PaliGemma2 multi-LoRA vision inference

631 0 0

docvision

Production-ready document parsing with Vision Language Models

575 1 0

flashtts

基于SparkTTS、OrpheusTTS等模型，提供高质量中文语音合成与声音克隆服务。

294 601 76

terradev-mcp

Complete Agentic GPU Infrastructure for Claude Code

120 10 2