Moe Python Packages | PyPI Stats

sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

306.7M 27K 6K

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

8.9M 79K 16K

flashinfer-python

FlashInfer: Kernel Library for LLM Serving

4.1M 6K 948

flashinfer-cubin

FlashInfer: Kernel Library for LLM Serving

2.7M 6K 948

sglang-kernel

SGLang is a high-performance serving framework for large language models and multimodal models.

273K 27K 6K

sgl-kernel

SGLang is a high-performance serving framework for large language models and multimodal models.

254K 27K 6K

ms-swift

Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, Phi4, ...) (AAAI 2025).

176K 14K 1K

vllm-tpu

A high-throughput and memory-efficient inference and serving engine for LLMs

145K 79K 16K

llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

28K 71K 9K

tensorrt-llm

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

16K 14K 2K

sglang-kt

SGLang is a high-performance serving framework for large language models and multimodal models.

4K 27K 6K

awex

A high-performance RL training-inference weight synchronization framework, designed to enable second-level parameter updates from training to inference in RL workflows

4K 150 17

terradev-cli

Cross-Cloud Compute Optimization Platform with Migration & Evaluation - v4.0.12

3K 10 1

abliterix

Automated alignment adjustment for LLMs — direct steering, LoRA, and MoE expert-granular abliteration, optimized via multi-objective Optuna TPE.

2K 215 42

llmtuner

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

2K 71K 9K

uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

1K 1K 144

mlx-flash

Run AI models too large for your Mac's memory — expert caching, speculative execution, and 15+ research techniques for MoE inference on Apple Silicon

857 2 0

switch-transformers

Implementation of Switch Transformers from the paper: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity"

836 139 17

plato-edge

Edge-optimized Cocapn fleet packages for ARM64 — pure Python, zero deps, <100KB

742 2 0

dblcsgen

SGLang is a high-performance serving framework for large language models and multimodal models.

613 27K 6K

lazyllm-llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

501 71K 9K

vllm-xft

A high-throughput and memory-efficient inference and serving engine for LLMs

485 79K 16K

vllm-acc

A high-throughput and memory-efficient inference and serving engine for LLMs

484 79K 16K

vllm-hust

A high-throughput and memory-efficient inference and serving engine for LLMs

480 79K 16K