PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Evals Python Packages

Python packages with the GitHub topic evals. Sorted by relevance, with stars and monthly downloads.
pydantic
logfire

AI observability platform for production LLM and agent systems.

25M 4K 230
Arize-ai
arize-phoenix

AI Observability & Evaluation

2.2M 10K 850
Arize-ai
arize-phoenix-otel

AI Observability & Evaluation

1.7M 10K 850
Arize-ai
arize-phoenix-client

AI Observability & Evaluation

898K 10K 850
Arize-ai
arize-phoenix-evals

AI Observability & Evaluation

768K 10K 850
AgentOps-AI
agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

730K 6K 571
truera
trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

86K 3K 271
truera
trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271
truera
trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271
truera
trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

57K 3K 271
truera
trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

48K 3K 271
truera
trulens

Evaluation and Tracking for LLM Experiments and AI Agents

45K 3K 271
truera
trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

38K 3K 271
harbor-framework
harbor-rewardkit

Harbor is a framework for running agent evaluations and creating and using RL environments.

38K 2K 978
manav8498
shadow-diff

Behavior contracts for AI agents

24K 4 0
truera
trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

22K 3K 271
dustalov
evalica

Evalica, your favourite evaluation toolkit

21K 62 5
truera
trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

14K 3K 271
truera
trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

12K 3K 271
ben-ranford
cellin

build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting

11K 0 0
truera
trulens-apps-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

10K 3K 271
blackwell-systems
mcp-assert

Test your MCP server against the real protocol. Any language, any transport. No mocks, no imports, no language lock-in.

8K 4 1
truera
trulens-apps-llamaindex

Evaluation and Tracking for LLM Experiments and AI Agents

6K 3K 271
johnnichev
selectools

Production-ready Python framework for AI agents with built-in guardrails, audit logging, cost tracking, and hybrid RAG. Supports OpenAI, Anthropic, Gemini, Ollama. By NichevLabs.

6K 9 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery