Llm Testing Python Packages

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

10K 6 0

langtest

Pacific AI provides a library for delivering safe & effective NLP models.

3K 556 49

agentneo

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

2K 16K 4K

agentassay

Token-efficient stochastic testing for AI agents. 5-20x cost reduction. 10 framework adapters. Paper: arXiv:2603.02601

2K 4 1

nlptest

Deliver safe & effective language models

2K 556 49

infer-check

Correctness and reliability testing for LLM inference engines

2K 2 0

llamator

Framework for testing vulnerabilities of GenAI systems.

1K 207 19

ai-safety-tester

LLM security testing framework with CVE-style severity scoring and multi-model benchmarking

949 0 0

agentcloudkelp

YAML-first stress testing for AI agents. Write a contract, inject faults, catch behavioral drift, enforce cost budgets. No Python test code needed — just kelp.yaml and a terminal.

849 1 0

llm-behave

Behavioral testing for LLM applications. pytest plugin with semantic assertions, multi-turn conversation testing, and drift detection. No LLM judge needed.

676 1 0

api-test-ninja

API Testing Framework to automate and simplify API testing using LLM Agents and tests defined in plain English.

633 2 1

ccheck

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

527 95 11

trajex

AI agent behavioral testing — learns what correct looks like, catches deviations automatically. Zero API keys needed.

389 0 0

mocktopus

🐙 Multi-armed mocks for LLM apps - Drop-in replacement for OpenAI/Anthropic APIs for deterministic testing

342 6 0

toolcallcheck

Deterministic Python testing for tool-using agents. Mock MCP tools, assert exact tool calls and trajectories, verify headers, and run offline in CI.

338 0 0

tinyqabenchmarkpp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

308 15 0

llm-sentry

Unified AI Reliability Platform. One install, 12 diagnostic engines. Zero-dependency LLM pipeline monitoring.

275 0 0

pyllmtest

🚀 Comprehensive testing framework for LLM applications with semantic assertions, multi-provider support, RAG testing, and prompt optimization. Test AI the right way!

209 1 0

agent-assembly-line

The simple way to build and embed AI agents into any software stack. Code-native, modular, and LLM-agnostic.

194 0 2

syncreus-eval

Evaluate your LLM apps with one function call. Hallucination detection, RAG scoring, and agent evals for OpenAI, Anthropic, and more. 14 evaluators, pytest plugin, composite trust scores.

166 2 0

llmtest-framework

pytest for LLM apps - Test for grounding failures, prompt injection, safety violations, and regressions

163 3 0

misalign

A Python library testing LLMs with prompts

125 0 0

ragaliq

LLM & RAG evaluation testing framework✨

100 1 0