Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.
The prompt engineering, prompt management, and prompt evaluation tool for Python