Llm Serving Python Packages

ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

52.7M 42K 8K

vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

9.4M 79K 16K

skypilot

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

1.8M 10K 1K

skypilot-nightly

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

482K 10K 1K

bentoml

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

198K 9K 959

vllm-tpu

A high-throughput and memory-efficient inference and serving engine for LLMs

143K 79K 16K

ant-ray-cpp-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

49K 42K 8K

ray-cpp

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

39K 42K 8K

mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

18K 899 72

openllm

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

18K 12K 807

trainy-skypilot-nightly

Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, Slurm, 20+ clouds, on-prem).

17K 10K 1K

tensorrt-llm

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

16K 14K 2K

ant-ray-nightly

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

12K 42K 8K

lorax-client

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

8K 4K 312

vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

7K 2K 1K

ant-ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

6K 42K 8K

openllm-core

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

5K 12K 807

friendli-client

[⛔️ DEPRECATED] Friendli: the fastest serving engine for generative AI

5K 50 7

openllm-client

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

4K 12K 807

superduper-openai

Superduper allows users to work with openai API models.

3K 5K 538

fastdeploy-python

Deploy Kit Tool For Deeplearning models.

2K 4K 744

superduper-framework

Superduper: End-to-end framework for building custom AI applications and agents.

2K 5K 538

bentoml-unsloth

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

2K 9K 959

faster-outlines

Faster, lazy backend for the `Outlines` library

1K 5 0

Search Packages