PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Llm Evaluation Python Packages

Python packages with the GitHub topic llm-evaluation. Sorted by relevance, with stars and monthly downloads.
mlflow
mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

38.3M 26K 6K
mlflow
mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

36.4M 26K 6K
mlflow
mlflow-tracing

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

16.2M 26K 6K
comet-ml
opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

4.8M 19K 1K
confident-ai
deepeval

The LLM Evaluation Framework

3.5M 15K 1K
Arize-ai
arize-phoenix

AI Observability & Evaluation

2.2M 10K 850
Arize-ai
arize-phoenix-otel

AI Observability & Evaluation

1.7M 10K 850
Arize-ai
arize-phoenix-client

AI Observability & Evaluation

895K 10K 850
Arize-ai
arize-phoenix-evals

AI Observability & Evaluation

770K 10K 850
Microsoft
prompty

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

440K 1K 113
JudgmentLabs
judgeval

The open source post-building layer for agents. Our environment data and evals power agent post-training (RL, SFT) and monitoring.

349K 1K 91
truera
trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

86K 3K 271
NVIDIA
garak

the LLM vulnerability scanner

75K 8K 922
comet-ml
opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

60K 19K 1K
truera
trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

60K 3K 271
truera
trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271
truera
trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

57K 3K 271
agenta-ai
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

56K 4K 517
truera
trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

48K 3K 271
truera
trulens

Evaluation and Tracking for LLM Experiments and AI Agents

44K 3K 271
Giskard-AI
giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

40K 5K 446
truera
trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

38K 3K 271
truera
trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

22K 3K 271
EvolvingLMMs-Lab
lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

17K 4K 578
    • Data from PyPI, GitHub, ClickHouse, and BigQuery