Llm Inference Python Packages

ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

52.9M 42K 8K

flashinfer-python

FlashInfer: Kernel Library for LLM Serving

4.1M 6K 948

flashinfer-cubin

FlashInfer: Kernel Library for LLM Serving

2.7M 6K 948

openvino

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

1.4M 10K 3K

bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

197K 9K 959

openvino-dev

OpenVINO™ is an open source toolkit for optimizing and deploying AI inference

170K 10K 3K

kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

115K 5K 1K

gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

77K 77K 8K

ant-ray-cpp-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

48K 42K 8K

ray-cpp

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

40K 42K 8K

quantcpp

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

39K 386 42

yunchang

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

38K 666 79

vllm-cpu

Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets

30K 6 0

openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

17K 12K 807

monocle-apptrace

Monocle is a framework for tracing GenAI app code. This repo contains implementation of Monocle for GenAI apps written in Python.

17K 94 32

prompt-poet

Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.

15K 1K 95

litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

14K 13K 1K

optillm

Optimizing inference proxy for LLMs

12K 3K 266

ant-ray-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

12K 42K 8K

lorax-client

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

8K 4K 312

deepsparse

Sparsity-aware deep learning inference runtime for CPUs

6K 3K 192

dgxarley

Ansible playbooks for a 3-node K3s cluster with NVIDIA DGX Spark nodes for distributed LLM inference

6K 1 0

intel-extension-for-transformers

Repository of Intel® Intel Extension for Transformers

6K 2K 217

dandy

Dandy is an intelligence framework for developing programmatic solutions using artificial intelligence.

6K 4 1