Rag Evaluation Python Packages

agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

57K 4K 517

giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

40K 5K 446

autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

9K 5K 398

rag-forge-observability

Production-grade RAG pipelines with evaluation baked in

3K 7 0

rag-forge-core

Production-grade RAG pipelines with evaluation baked in

3K 7 0

rag-forge-evaluator

Production-grade RAG pipelines with evaluation baked in

3K 7 0

ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5

kb-arena

Benchmark 9 retrieval architectures on your documentation — find which KB architecture fits your data

2K 7 2

llamator

Framework for testing vulnerabilities of GenAI systems.

1K 207 19

open-rag-eval

A Python package for RAG Evaluation

824 358 23

rag-benchmarking

RAG Benchmarking — Framework-agnostic RAG/agentic-AI evaluation harness. Faithfulness, agentic metrics, EU AI Act Article 15 accuracy evidence. Apache 2.0.

503 0 0

rurage

RURAGE (Robust Universal RAG Evaluation) is a Python library developed to speed-up evaluation of RAG systems with Correctness, Faithfulness and Relevance axes using a variety of deterministic and model-based metrics.

443 34 0

smallevals

Small Language Models Evaluation Suite for RAG Systems

370 18 2

vero-eval

Open source framework for evaluating AI Agents

261 29 2

rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

181 2 1

syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

164 2 0

eval-rag

A comprehensive evaluation toolkit for assessing Retrieval-Augmented Generation (RAG) outputs using linguistic, semantic, and fairness metrics

118 4 0

ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

86 1 0

Search Packages