Triton Python Packages

liger-kernel

Efficient Triton Kernels for LLM Training

790K 6K 526

sageattention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

155K 3K 403

liger-kernel-nightly

Efficient Triton Kernels for LLM Training

63K 6K 526

tritonparse

TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels

17K 204 26

clearml-serving

ClearML - Model-Serving Orchestration and Repository Solution

11K 164 50

turboquant-vllm

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

7K 46 5

fastdeploy

Deploy DL/ ML inference pipelines with minimal extra code.

4K 103 17

hilbertsfc

Ultra-fast 2D & 3D Hilbert curve kernels in Python. JIT compiled, branchless, L1-cache-friendly lookup tables, loop unrolling, SIMD, and multi-threading.

2K 6 1

torchgw

Fast Sampled Gromov-Wasserstein optimal transport solver — pure PyTorch, scalable, differentiable

1K 1 0

dlblas

DLBlas: clean and efficient kernels

961 39 12

triton-runner

Multi-Level Triton Runner supporting Python, IR, PTX, AMDGCN, cubin and hasco.

866 95 5

flag-gems

FlagGems is an operator library for large language models implemented in the Triton Language.

785 982 349

fdclient

fastDeploy python client

753 103 17

triformer

Transformers components but in Triton

571 34 1

flash-sparse-attn

Trainable fast and memory-efficient sparse attention

534 637 60

hip-attn

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

432 151 15

transformersplus

Add Some plus extra features to transformers

407 0 0

flash-sinkhorn

Sinkhorn optimal transport kernels in PyTorch + Triton (squared Euclidean, no cost matrix materialization).

365 189 19

atlas-quantum

GPU-accelerated quantum tensor network simulator with adaptive MPS

357 0 0

grammared-language

Adding Grammarly (and other) open source ML models to LanguageTool

346 6 0

tritonllm

LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model

281 114 5

fused-turboquant

Fused Triton encode/decode kernels for TurboQuant KV cache compression, powered by Randomized Hadamard Transform.

226 8 0

quick-deploy

Optimize, convert and deploy machine learning models as fast inference API using Triton and ORT. Currently support Hugging Face transformers, PyToch, Tensorflow, SKLearn and XGBoost models.

221 6 1

ssblast

First open-source FP8 linear solver for consumer NVIDIA GPUs — 2-3x faster than cuBLAS FP64. pip install ssblast

185 0 0