A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.
Benchmark LLMs with 10 benchmarks & 132K+ questions. 8 providers: OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek, Ollama, HuggingFace. Unified CLI + Web dashboard.
CLI tool to evaluate LLM factuality on MMLU benchmark.