Benchmark Python Packages

swebench

SWE-bench: Can Language Models Resolve Real-world Github Issues?

39.1M 5K 849

pytest-benchmark

pytest fixture for benchmarking code

13.5M 1K 132

mteb

MTEB: Massive Text Embedding Benchmark

2.8M 3K 608

pytest-harvest

Store data created during your `pytest` tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes.

457K 76 10

asv

Airspeed Velocity: A simple Python benchmarking tool with web-based reporting

403K 998 204

motmetrics

:bar_chart: Benchmark multiple object trackers (MOT) in Python

214K 1K 262

evo

Python package for the evaluation of odometry and SLAM

185K 4K 790

google-benchmark

A microbenchmark support library

117K 10K 2K

pyperformance

Python Performance Benchmark Suite

82K 1K 203

picows

Ultra-fast websocket client and server for asyncio

57K 265 18

membrowse

Track and analyze binary size and memory footprint in embedded firmware

46K 20 1

medmnist

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

42K 1K 207

beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

41K 2K 243

mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.

41K 8K 1K

pytest-django-queries

Generate performance reports from your django database performance tests.

32K 83 2

agentdojo

A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.

23K 548 141

evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

19K 2K 195

holobench

A package for benchmarking the characteristics of arbitrary functions

18K 4 3

optunahub

Python library to use and implement packages in OptunaHub

18K 55 15

lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

17K 4K 578

optimum-benchmark

🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of Optimum's hardware optimizations & quantization schemes.

13K 336 58

apebench

[Neurips 2024] A benchmark suite for autoregressive neural emulation of PDEs. (≥46 PDEs in 1D, 2D, 3D; Differentiable Physics; Unrolled Training; Rollout Metrics)

13K 100 2

logparser3

A machine learning toolkit for log parsing [ICSE'19, DSN'16]

12K 2K 580

gauntlet-cli

Behavioral reliability under pressure. Test how LLMs behave when things get hard.

11K 6 0