Llm Testing Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0

langtest

Pacific AI provides a library for delivering safe & effective NLP models.

3K 556 49

agentneo

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

2K 16K 4K

agentassay

Token-efficient stochastic testing for AI agents. 5-20x cost reduction. 10 framework adapters. Paper: arXiv:2603.02601

2K 4 1

infer-check

Correctness and reliability testing for LLM inference engines

2K 2 0

nlptest

Deliver safe & effective language models

2K 556 49

llamator

Framework for testing vulnerabilities of GenAI systems.

1K 207 19

ai-safety-tester

LLM security testing framework with CVE-style severity scoring and multi-model benchmarking

847 0 0

agentcloudkelp

YAML-first stress testing for AI agents. Write a contract, inject faults, catch behavioral drift, enforce cost budgets. No Python test code needed — just kelp.yaml and a terminal.

740 1 0

llm-behave

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

586 1 0

api-test-ninja

API Testing Framework to automate and simplify API testing using LLM Agents and tests defined in plain English.

570 2 1

ccheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

508 95 11

toolcallcheck

Deterministic Python testing for tool-using agents. Mock MCP tools, assert exact tool calls and trajectories, verify headers, and run offline in CI.

400 0 0

trajex

AI agent behavioral testing — learns what correct looks like, catches deviations automatically. Zero API keys needed.

366 0 0

mocktopus

🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing

340 6 0

tinyqabenchmarkpp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

260 15 0

llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

215 0 0

syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

164 2 0

pyllmtest

🚀 Comprehensive testing framework for LLM applications with semantic assertions, multi-provider support, RAG testing, and prompt optimization. Test AI the right way!

147 1 0

llmtest-framework

pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions

138 3 0

misalign

A Python library testing LLMs with prompts

120 0 0

agent-assembly-line

The simple way to build and embed AI agents into any software stack. Code-native, modular, and LLM-agnostic.

116 0 2

ragaliq

LLM & RAG evaluation testing framework — hallucination detection, faithfulness metrics, answer relevance scoring, and retrieval pipeline testing with pytest integration

86 1 0

Search Packages