PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
confident-ai
deepeval

The LLM Evaluation Framework

3.5M 15K 1K
AgentOps-AI
agentops

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

718K 6K 571
MiXaiLL76
faster-coco-eval

Continuation of an abandoned project fast-coco-eval

534K 141 11
Unbabel
unbabel-comet

A Neural Framework for MT Evaluation

272K 743 108
AmenRa
ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

84K 674 31
MantisAI
nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

35K 213 27
ibm
unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

33K 212 67
songweige
cd-fvd

[CVPR 2024] On the Content Bias in Fréchet Video Distance

27K 146 8
huggingface
lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

22K 2K 454
fakufaku
fast-bss-eval

A fast implementation of bss_eval metrics for blind source separation

21K 146 9
thieu1995
permetrics

Artificial intelligence (AI, ML, DL) performance metrics implemented in Python

16K 91 22
google-research
rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

15K 872 49
noutenki
pyrouge

A Python wrapper for the ROUGE summarization evaluation package

12K 249 72
k4black
codebleu

Pip compatible CodeBLEU metric implementation available for linux/macos/win

7K 133 29
kqf
ir-metrics

The most common information retrieval (IR) metrics

4K 5 0
proycon
pynlpl

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

4K 476 66
clovaai
prdc

Code base for the precision, recall, density, and coverage metrics for generative models. ICML 2020.

3K 270 28
GiulioRossetti
nf1

A novel approach to evaluate community detection algorithms on ground truth

3K 20 9
erdogant
classeval

Evaluation of supervised predictions for two-class and multi-class classifiers

3K 8 2
MIND-LAB
octis

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

2K 800 118
shi-ang
survivaleval

The most comprehensive Python package for evaluating survival analysis models.

2K 50 7
broundal
pytolemaic

Toolbox for analysis of model's quality and model's description. For further details see

2K 10 3
jsell-rh
k-eval

Simple context-aware evaluation framework for AI agents using MCP.

1K 2 0
evaluation-context-protocol
ecp-runtime

ECP is a standardized interface for orchestrating, auditing, and enforcing authority limits in AI Agent evaluations. It moves evaluation from "brittle Python scripts" to a deterministic infrastructure protocol

1K 8 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery