Evaluation Metrics Python Packages

deepeval

The LLM Evaluation Framework

3.5M 15K 1K

agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

718K 6K 571

faster-coco-eval

Continuation of an abandoned project fast-coco-eval

534K 141 11

unbabel-comet

A Neural Framework for MT Evaluation

272K 743 108

ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

84K 674 31

nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

35K 213 27

unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

33K 212 67

cd-fvd

[CVPR 2024] On the Content Bias in Fréchet Video Distance

27K 146 8

lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

22K 2K 454

fast-bss-eval

A fast implementation of bss_eval metrics for blind source separation

21K 146 9

permetrics

Artificial intelligence (AI, ML, DL) performance metrics implemented in Python

16K 91 22

rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

15K 872 49

pyrouge

A Python wrapper for the ROUGE summarization evaluation package

12K 249 72

codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

7K 133 29

ir-metrics

The most common information retrieval (IR) metrics

4K 5 0

pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

4K 476 66

prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

3K 270 28

nf1

A novel approach to evaluate community detection algorithms on ground truth

3K 20 9

classeval

Evaluation of supervised predictions for two-class and multi-class classifiers

3K 8 2

octis

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

2K 800 118

survivaleval

The most comprehensive Python package for evaluating survival analysis models.

2K 50 7

pytolemaic

Toolbox for analysis of model's quality and model's description. For further details see

2K 10 3

k-eval

Simple context-aware evaluation framework for AI agents using MCP.

1K 2 0

ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1

Search Packages