PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Model Evaluation Python Packages

Python packages with the GitHub topic model-evaluation. Sorted by relevance, with stars and monthly downloads.
Basaltlabs-app
gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0
debiai
debiai-gui

Bias detection and contextual evaluation tool for your AI projects

3K 30 5
star-whale
starwhale-bootstrap

an MLOps/LLMOps platform

2K 238 39
metno
pyaerocom

Python tools for climate and air quality model evaluation

2K 31 15
oaslananka
fovux-mcp

Local-first YOLO workbench for edge-AI computer vision: MCP server + VS Code Studio for dataset inspection, training, evaluation, export, and RTSP inference.

2K 1 0
star-whale
starwhale

an MLOps/LLMOps platform

2K 238 39
stef41
modeldiffx

Model behavioral diffing - compare LLM outputs across versions, detect regressions.

1K 1 0
Khanz9664
trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

1K 10 12
evaluation-context-protocol
ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
evaluation-context-protocol
ecp-sdk

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

1K 5 1
burning-cost
insurance-cv

Temporal and distributional cross-validation, and feature screening, for insurance pricing models

533 0 0
wmjg-alt
easymlselector

A model selection process for Machine Learning tasks on subset of training sample

465 0 0
daniel-yj-yang
machlearn

A Simple Yet Powerful Machine Learning Python Library

405 1 0
animator
titus2

Titus 2 : Portable Format for Analytics (PFA) implementation for Python 3.4+

327 24 2
Ricardouchub
evalcards

Librería Python para generar reportes de evaluación (clasificación, regresión, forecasting) con métricas y gráficos listos en Markdown, JSON y pronto HTML.

312 1 0
Rowusuduah
llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

275 0 0
metriculous-ml
metriculous

Measure and visualize machine learning model performance without the usual boilerplate.

272 98 11
Lantianzz
scorecardbundle

A High-level Scorecard Modeling API | 评分卡建模尽在于此

229 83 30
vAndrewKarma
utils-axn-2237

🧰 Essential EDA and Data Cleaning Helpers for Any DataFrame This collection of functions is designed to accelerate exploratory data analysis (EDA), quickly surface data quality issues, and offer high-level insights into the structure and content of your dataset.

211 3 2
vAndrewKarma
data-science-snippets

A modular set of data science utilities for EDA, cleaning, and more.

206 3 2
medoidai
skrobot

skrobot is a Python module for designing, running and tracking Machine Learning experiments / tasks. It is built on top of scikit-learn framework.

204 24 2
luna-ml
luna-ml

Luna ML - ML Leaderboard for your team with automatic model evaluation

199 4 1
Eklavya20
diagnost

A diagnostics toolkit to help data scientists trust their models, not just train them.

173 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery