PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
agenta-ai
agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

57K 4K 517
Giskard-AI
giskard

🐢 Open-Source Evaluation & Testing library for LLM Agents

40K 5K 446
Marker-Inc-Korea
autorag

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

9K 5K 398
hallengray
rag-forge-observability

Production-grade RAG pipelines with evaluation baked in

3K 7 0
hallengray
rag-forge-core

Production-grade RAG pipelines with evaluation baked in

3K 7 0
hallengray
rag-forge-evaluator

Production-grade RAG pipelines with evaluation baked in

3K 7 0
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
xmpuspus
kb-arena

Benchmark 9 retrieval architectures on your documentation — find which KB architecture fits your data

2K 7 2
LLAMATOR-Core
llamator

Framework for testing vulnerabilities of GenAI systems.

1K 207 19
vectara
open-rag-eval

A Python package for RAG Evaluation

824 358 23
aiexponenthq
rag-benchmarking

RAG Benchmarking — Framework-agnostic RAG/agentic-AI evaluation harness. Faithfulness, agentic metrics, EU AI Act Article 15 accuracy evidence. Apache 2.0.

503 0 0
mts-ai
rurage

RURAGE (Robust Universal RAG Evaluation) is a Python library developed to speed-up evaluation of RAG systems with Correctness, Faithfulness and Relevance axes using a variety of deterministic and model-based metrics.

443 34 0
mburaksayici
smallevals

Small Language Models Evaluation Suite for RAG Systems

370 18 2
vero-labs-ai
vero-eval

Open source framework for evaluating AI Agents

261 29 2
RAILethicsHub
rail-score

DEPRECATED — use rail-score-sdk instead. This package redirects to rail-score-sdk.

181 2 1
syncreus
syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

164 2 0
shaadclt
eval-rag

A comprehensive evaluation toolkit for assessing Retrieval-Augmented Generation (RAG) outputs using linguistic, semantic, and fairness metrics

118 4 0
dariero
ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

86 1 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery