Model Evaluation Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0

debiai-gui

Bias detection and contextual evaluation tool for your AI projects

3K 30 5

starwhale-bootstrap

an MLOps/LLMOps platform

2K 238 39

starwhale

an MLOps/LLMOps platform

2K 238 39

pyaerocom

Python tools for climate and air quality model evaluation

2K 31 15

fovux-mcp

Local-first YOLO workbench for edge-AI computer vision: MCP server + VS Code Studio for dataset inspection, training, evaluation, export, and RTSP inference.

2K 1 0

modeldiffx

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

1K 1 0

ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1

ecp-sdk

1K 8 1

trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

995 10 12

judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1

insurance-cv

Temporal and distributional cross-validation, and feature screening, for insurance pricing models

539 0 0

easymlselector

A model selection process for Machine Learning tasks on subset of training sample

499 0 0

machlearn

A Simple Yet Powerful Machine Learning Python Library

400 1 0

evalcards

Librería Python para generar reportes de evaluación (clasificación, regresión, forecasting) con métricas y gráficos listos en Markdown, JSON y pronto HTML.

342 1 0

titus2

Titus 2 : Portable Format for Analytics (PFA) implementation for Python 3.4+

323 24 2

metriculous

Measure and visualize machine learning model performance without the usual boilerplate.

231 98 11

skrobot

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

218 24 2

scorecardbundle

A High-level Scorecard Modeling API | 评分卡建模尽在于此

217 83 30

llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

215 0 0

data-science-snippets

A modular set of data science utilities for EDA, cleaning, and more.

190 3 2

luna-ml

Luna ML - ML Leaderboard for your team with automatic model evaluation

189 4 1

utils-axn-2237

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

179 3 2

smartpredict

An advanced machine learning library designed to simplify model training, evaluation, and selection.

172 0 0

Search Packages