A pip-installable benchmark runner for LLMs and agents. Five minutes to your first eval.
Tune the initial recurrent state of hybrid models. Zero inference overhead.
A strict, auditable HumanEval benchmark runner for GGUF models served via llama.cpp, using its OpenAI-compatible HTTP API.