Evals Python Packages

logfire

AI observability platform for production LLM and agent systems.

25M 4K 230

arize-phoenix

AI Observability & Evaluation

2.2M 10K 850

arize-phoenix-otel

AI Observability & Evaluation

1.7M 10K 850

arize-phoenix-client

AI Observability & Evaluation

898K 10K 850

arize-phoenix-evals

AI Observability & Evaluation

768K 10K 850

agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

730K 6K 571

trulens-core

Evaluation and Tracking for LLM Experiments and AI Agents

86K 3K 271

trulens-otel-semconv

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271

trulens-feedback

Evaluation and Tracking for LLM Experiments and AI Agents

59K 3K 271

trulens-dashboard

Evaluation and Tracking for LLM Experiments and AI Agents

57K 3K 271

trulens-eval

Evaluation and Tracking for LLM Experiments and AI Agents

48K 3K 271

trulens

Evaluation and Tracking for LLM Experiments and AI Agents

45K 3K 271

trulens-connectors-snowflake

Evaluation and Tracking for LLM Experiments and AI Agents

38K 3K 271

harbor-rewardkit

Harbor is a framework for running agent evaluations and creating and using RL environments.

38K 2K 978

shadow-diff

Behavior contracts for AI agents

24K 4 0

trulens-providers-cortex

Evaluation and Tracking for LLM Experiments and AI Agents

22K 3K 271

evalica

Evalica, your favourite evaluation toolkit

21K 62 5

trulens-providers-openai

Evaluation and Tracking for LLM Experiments and AI Agents

14K 3K 271

trulens-providers-litellm

Evaluation and Tracking for LLM Experiments and AI Agents

12K 3K 271

cellin

build long-lived multimodal memory, dream over it, and retrieve context with transparent weighting

11K 0 0

trulens-apps-langchain

Evaluation and Tracking for LLM Experiments and AI Agents

10K 3K 271

mcp-assert

Test your MCP server against the real protocol. Any language, any transport. No mocks, no imports, no language lock-in.

8K 4 1

trulens-apps-llamaindex

Evaluation and Tracking for LLM Experiments and AI Agents

6K 3K 271

selectools

Production-ready Python framework for AI agents with built-in guardrails, audit logging, cost tracking, and hybrid RAG. Supports OpenAI, Anthropic, Gemini, Ollama. By NichevLabs.

6K 9 1