Agent Benchmark Python Packages

evalview

Open-source testing and regression detection framework for AI agents. Golden baseline diffing, CI/CD integration, works with LangGraph, CrewAI, OpenAI, Anthropic Claude, HuggingFace, Ollama, and MCP.

3K 95 21

tracecore

A lightweight benchmark for action-oriented agents.

1K 8 0

codejoust

A CLI arena for AI coding agents. Throw one bug at Claude Code, Codex, aider — let them race, auto-score, and pick the winner.

763 3 0

Search Packages