Llm Evaluation Python Packages

mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

38.3M 26K 6K

mlflow

36.4M 26K 6K

mlflow-tracing

16.2M 26K 6K

opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

4.8M 19K 1K

deepeval

The LLM Evaluation Framework

3.5M 15K 1K

arize-phoenix

AI Observability & Evaluation

2.2M 10K 850

arize-phoenix-otel

AI Observability & Evaluation

1.7M 10K 850

arize-phoenix-client

AI Observability & Evaluation

895K 10K 850

arize-phoenix-evals

AI Observability & Evaluation

770K 10K 850

prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

440K 1K 113

judgeval

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

349K 1K 91

trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

86K 3K 271

garak

the LLM vulnerability scanner

75K 8K 922

opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

60K 19K 1K

trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

60K 3K 271

trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271

trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

57K 3K 271

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

56K 4K 517

trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

48K 3K 271

trulens

Evaluation and Tracking for LLM Experiments and AI Agents

44K 3K 271

giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

40K 5K 446

trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

38K 3K 271

trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

22K 3K 271

lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

17K 4K 578