Llm As A Judge Python Packages

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

57K 4K 517

llm-council-core

Multi-LLM council system with peer review and synthesis

3K 18 7

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5

dingo-python

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

2K 691 71

verdict

Inference-time scaling for LLMs-as-a-judge.

2K 339 25

scorable

The Python SDK for API of Scorable

1K 14 1

judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1

mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

879 17 9

root-signals

Scorable SDK

777 14 1

openevalkit

Production-grade Python framework for evaluating LLM and agentic systems with traditional scorers, LLM judges (OpenAI, Anthropic, Ollama, 100+ models via LiteLLM), ensemble aggregation, and smart caching for cost-effective testing.

766 3 0

pytest-llm-rubric

Pytest plugin for semantic PASS/FAIL checks using LLM-as-a-Judge

672 0 0

veritail

Ecommerce search relevance evaluation tool

541 5 1

vllm-judge

A tiny, lightweight library for LLM-as-a-Judge evaluations on vLLM-hosted models.

432 2 2

docling-sdg

A set of tools to create synthetically-generated data from documents

417 45 17

xfinder

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

291 181 7

artemis-agents

Production-ready multi-agent debate framework with adaptive evaluation and safety monitoring

239 0 0

llm-summary

Use an LLM to summarize paragraphs

205 0 0

root-signals-cli

CLI for the Root Signals API

144 14 1

scorable-cli

Scorable SDK

138 14 1

iflow-mcp-hepivax-mcp-as-a-judge

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

101 17 9

antibodies-rafaelsandroni

Antibodies for LLM hallucinations

97 0 0

llm-antibodies

Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)

92 0 0

dingo-client

Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool

49 693 71

Search Packages