Ai Evaluation Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0

uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

6K 1K 121

eval-ai-library

Comprehensive AI Model Evaluation Framework with support for multiple LLM providers

6K 33 3

agent-action-guard

🛡️ Safe AI Agents through Action Classifier

2K 9 6

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5

ai-stability

Measure LLM output consistency from the command line.

1K 0 0

promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

1K 101 2

evalstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

1K 101 2

judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1

chatbot-auditor

Quality auditor for AI chatbots. Analyzes your conversation logs to show where the bot is underperforming.

853 0 0

workflowbench

Lightweight benchmark harness for AI-driven business workflows

685 1 0

gaico

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

680 6 2

aisert

Assert-style validation library for AI outputs - ensure your LLMs behave exactly as expected.

237 1 0

rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

181 2 1

factly-eval

CLI tool to evaluate LLM factuality on MMLU benchmark.

67 2 0

aspire-ai

Adversarial Student-Professor Internalized Reasoning Engine

25 0 0

Search Packages