PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
Basaltlabs-app
gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0
cvs-health
uqlm

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

6K 1K 121
meshkovQA
eval-ai-library

Comprehensive AI Model Evaluation Framework with support for multiple LLM providers

6K 33 3
Pro-GenAI
agent-action-guard

🛡️ Safe AI Agents through Action Classifier

2K 9 6
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
buildwithabid
ai-stability

Measure LLM output consistency from the command line.

1K 0 0
ianarawjo
promptstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

1K 101 2
ianarawjo
evalstats

Statistical analysis methods for comparing prompt and model performance in LLM evaluations.

1K 101 2
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1
HemantBK
chatbot-auditor

Quality auditor for AI chatbots. Analyzes your conversation logs to show where the bot is underperforming.

853 0 0
thegeekajay
workflowbench

Lightweight benchmark harness for AI-driven business workflows

685 1 0
ai4society
gaico

A Python library providing evaluation metrics to compare generated texts from LLMs, often against reference texts. Features streamlined workflows for model comparison and visualization.

680 6 2
haipad
aisert

Assert-style validation library for AI outputs - ensure your LLMs behave exactly as expected.

237 1 0
RAILethicsHub
rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

181 2 1
sergeyklay
factly-eval

CLI tool to evaluate LLM factuality on MMLU benchmark.

67 2 0
mcp-tool-shop-org
aspire-ai

Adversarial Student-Professor Internalized Reasoning Engine

25 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery