Evaluation Framework Python Packages

deepeval

The LLM Evaluation Framework

3.5M 15K 1K

lm-eval

A framework for few-shot evaluation of language models.

1.4M 12K 3K

lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

22K 2K 454

irspack

Train, evaluate, and optimize implicit feedback-based recommender systems.

13K 31 10

scandeval

The robust European language model benchmark.

13K 175 56

euroeval

The robust European language model benchmark.

5K 175 56

letta-evals

Evaluation kit for testing stateful agents

4K 70 9

agentlab

AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

3K 574 112

kiln-ai

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

3K 5K 361

aiverify-moonshot

Moonshot - A simple and modular tool to evaluate and red-team any LLM application.

2K 322 61

kaiko-eva

Evaluation framework for oncology foundation models (FMs)

2K 156 38

pyrddlgym

A toolkit for auto-generation of OpenAI Gym environments from RDDL description files.

2K 93 23

zenoml

AI Data Management & Evaluation Platform

2K 214 11

kiln-server

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

1K 5K 361

lapixdl

Python package with Deep Learning utilities for Computer Vision

1K 9 3

ceval

CEval is a Python package for evaluating the quality of counterfactual explanations produced by any post-hoc XAI (Explainable AI) method. It computes 14 established metrics with a single call and works with diverse model architecture.

1K 0 0

k-eval

Simple context-aware evaluation framework for AI agents using MCP.

1K 2 0

beyondbench

[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

1K 3 0

fieldtest

LLM evaluation framework — define what correct, well-formed, and safe means before you measure

1K 0 0

trustlens

Open-source Python library for evaluating ML model reliability beyond accuracy — with calibration, failure, and fairness diagnostics for informed deployment decisions.

995 10 12

judicator

Who evaluates the evaluator? Judicator audits LLM-as-a-Judge systems for 7 documented bias types. Zero config. Works with any LLM.

973 5 1

quica

quica is a tool to run inter coder agreement pipelines in an easy and effective ways. Multiple measures are run and results are collected in a single table than can be easily exported in Latex

918 23 0

gval

Flexible, portable, and efficient geospatial evaluations for a variety of data.

862 25 4

skill-lab

Agent Skills Evaluation Framework

837 51 4

Search Packages