Sglang Python Packages

smg-grpc-proto

Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat history, tokenization caching, Responses API, embeddings, WASM plugins, MCP, and multi-tenant auth.

829K 206 62

smg-grpc-servicer

560K 206 62

mooncake-transfer-engine

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

338K 5K 720

auto-round

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

71K 1K 125

gptqmodel

LLM model quantization (compression) toolkit with HW acceleration support for Nvidia, AMD, Intel GPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

38K 1K 185

auto-round-nightly

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

18K 1K 125

auto-round-lib

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

9K 1K 125

mooncake-transfer-engine-cuda13

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

7K 5K 720

dgxarley

Ansible playbooks for a 3-node K3s cluster with NVIDIA DGX Spark nodes for distributed LLM inference

7K 1 0

terradev-cli

Cross-Cloud Compute Optimization Platform with Migration & Evaluation - v4.0.12

3K 10 1

strands-sglang

SGLang model provider for Strands Agents for on-policy agentic RL training.

3K 52 8

smg

3K 206 62

mooncake-transfer-engine-non-cuda

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

3K 5K 720

infergrid

Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

2K 1 1

auto-round-hpu

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

2K 1K 125

agentic-coding-bench

Open-source benchmark for LLM inference on agentic coding workloads

2K 0 0

llm-cal

LLM inference hardware calculator — architecture-aware (MLA/NSA/MoE), engine-aware (vLLM/SGLang), honest-labeled. Reads real safetensors bytes; supports 53 GPUs (NVIDIA / AMD / Huawei Ascend / 沐曦 / 昆仑芯 / 壁仞 / 寒武纪 / 海光).

1K 1 0

kvcached

Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond

1K 902 107

kvwarden

Tenant-fair LLM inference orchestration on a single GPU. No Kubernetes.

996 2 1

auto-round-kernel

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

744 1K 125