Evaluation Python Packages

langsmith

LangSmith Client SDK Implementations

81M 871 228

mlflow-skinny

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

38.3M 26K 6K

mlflow

36.4M 26K 6K

simpleeval

Simple Safe Sandboxed Extensible Expression Evaluator for Python

16.7M 595 92

mlflow-tracing

16.2M 26K 6K

evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

7.1M 2K 318

opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

5.1M 19K 1K

ragas

Supercharge Your LLM Application Evaluations 🚀

1.4M 14K 1K

faster-coco-eval

Continuation of an abandoned project fast-coco-eval

539K 141 11

evo

Python package for the evaluation of odometry and SLAM

187K 4K 790

spreadscript

SpreadScript: Use a spreadsheet as a function.

94K 1 0

ranx

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

90K 674 31

opik-optimizer

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

58K 19K 1K

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

56K 4K 517

pytrec-eval

pytrec_eval is an Information Retrieval evaluation tool for Python, based on the popular trec_eval.

50K 346 36

evalidate

Safe and fast evaluation of untrusted user-supplied python expressions

49K 40 4

evalscope

A streamlined and customizable framework for efficient large model (LLM, VLM, AIGC) evaluation and performance benchmarking.

47K 3K 322

sprint-toolkit

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

41K 47 2

unitxt

🦄 Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

34K 212 67

kubetorch

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

30K 1K 57

runhouse

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

27K 1K 57

lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

23K 2K 454

fore

The fore client package

21K 13 1

evalica

Evalica, your favourite evaluation toolkit

21K 62 5