PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
confident-ai
deepeval

The LLM Evaluation Framework

3.5M 15K 1K
EleutherAI
lm-eval

A framework for few-shot evaluation of language models.

1.4M 12K 3K
huggingface
lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

22K 2K 454
tohtsky
irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

13K 31 10
EuroEval
scandeval

The robust European language model benchmark.

13K 175 56
EuroEval
euroeval

The robust European language model benchmark.

5K 175 56
letta-ai
letta-evals

Evaluation kit for testing stateful agents

4K 70 9
ServiceNow
agentlab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

3K 574 112
Kiln-AI
kiln-ai

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

3K 5K 361
aiverify-foundation
aiverify-moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

2K 322 61
kaiko-ai
kaiko-eva

Evaluation framework for oncology foundation models (FMs)

2K 156 38
pyrddlgym-project
pyrddlgym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

2K 93 23
zeno-ml
zenoml

AI Data Management & Evaluation Platform

2K 214 11
Kiln-AI
kiln-server

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

1K 5K 361
lapix-ufsc
lapixdl

Python package with Deep Learning utilities for Computer Vision

1K 9 3
b-bayrak
ceval

CEval is a Python package for evaluating the quality of counterfactual explanations produced by any post-hoc XAI (Explainable AI) method. It computes 14 established metrics with a single call and works with diverse model architecture.

1K 0 0
jsell-rh
k-eval

Simple context-aware evaluation framework for AI agents using MCP.

1K 2 0
ctrl-gaurav
beyondbench

[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

1K 3 0
gmitt98
fieldtest

LLM evaluation framework — define what correct, well-formed, and safe means before you measure

1K 0 0
Khanz9664
trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

995 10 12
ankurpand3y
judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1
vinid
quica

quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex

918 23 0
NOAA-OWP
gval

Flexible, portable, and efficient geospatial evaluations for a variety of data.

862 25 4
8ddieHu0314
skill-lab

Agent Skills Evaluation Framework

837 51 4
    • Data from PyPI, GitHub, ClickHouse, and BigQuery