Mmlu Python Packages

litebench

A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.

1K 0 0

llm-benchmark-toolkit

Benchmark LLMs with 10 benchmarks & 132K+ questions. 8 providers: OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek, Ollama, HuggingFace. Unified CLI + Web dashboard.

842 1 1

factly-eval

CLI tool to evaluate LLM factuality on MMLU benchmark.

72 2 0

Search Packages