Model Evaluation Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0

debiai-gui

Bias detection and contextual evaluation tool for your AI projects

3K 30 5

starwhale-bootstrap

an MLOps/LLMOps platform

2K 238 39

pyaerocom

Python tools for climate and air quality model evaluation

2K 31 15

fovux-mcp

Local-first YOLO workbench for edge-AI computer vision: MCP server + VS Code Studio for dataset inspection, training, evaluation, export, and RTSP inference.

2K 1 0

starwhale

an MLOps/LLMOps platform

2K 238 39

modeldiffx

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

1K 1 0

trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

1K 10 12

ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1

ecp-sdk

1K 8 1

judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 5 1

insurance-cv

Temporal and distributional cross-validation, and feature screening, for insurance pricing models

533 0 0

easymlselector

A model selection process for Machine Learning tasks on subset of training sample

465 0 0

machlearn

A Simple Yet Powerful Machine Learning Python Library

405 1 0

titus2

Titus 2 : Portable Format for Analytics (PFA) implementation for Python 3.4+

327 24 2

evalcards

Librería Python para generar reportes de evaluación (clasificación, regresión, forecasting) con métricas y gráficos listos en Markdown, JSON y pronto HTML.

312 1 0

llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

275 0 0

metriculous

Measure and visualize machine learning model performance without the usual boilerplate.

272 98 11

scorecardbundle

A High-level Scorecard Modeling API | 评分卡建模尽在于此

229 83 30

utils-axn-2237

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

211 3 2

data-science-snippets

A modular set of data science utilities for EDA, cleaning, and more.

206 3 2

skrobot

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

204 24 2

luna-ml

Luna ML - ML Leaderboard for your team with automatic model evaluation

199 4 1

diagnost

A diagnostics toolkit to help data scientists trust their models, not just train them.

173 0 0