PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
snorkel-team
snorkel

A system for quickly generating training data with weak supervision

79K 6K 854
a-maliarov
amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

71K 490 91
alteryx
composeml

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.

19K 510 50
ydataai
ydata-synthetic

Synthetic data generators for tabular and time-series data

9K 2K 260
sparkfish
augraphy

Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

6K 527 60
KennethEnevoldsen
augmenty

Augmenty is an augmentation library based on spaCy for augmenting texts.

4K 156 9
timbo4u1
s2s-certify

Physics-audit engine for Physical AI. 8 biomechanical laws certified in real-time — safety gate for robot training pipelines and prosthetics. pip install s2s-certify

3K 3 0
NorskRegnesentral
skweak

skweak: A software toolkit for weak supervision applied to NLP tasks

1K 926 77
stef41
datamix

Dataset mixing & curriculum optimizer — profile, blend, schedule, and budget training data. Zero deps.

789 1 0
stef41
datacruxai

Lightweight CPU-only data quality toolkit for LLM instruction tuning datasets.

781 1 0
stef41
castwright

Generate high-quality synthetic instruction-tuning data from seed examples. Simple API, built-in quality filtering, cost-aware.

772 1 0
MoonyFringers
ladon-crawl

A Python framework for building structured, resumable web crawlers — designed for domains where data quality matters.

758 1 1
phanii9
tidbit

Capture anything into structured Markdown notes and training-ready JSONL.

712 4 0
liuxiaotong
knowlyr-datasynth

Seed-to-scale LLM synthetic data engine with auto-detected templates, schema validation & quality-diversity optimization. CLI + MCP ready.

450 0 0
Data-Centric-AI-Community
fg-data-synthetic

Synthetic data generation methods with different synthetization methods.

422 2K 260
ychampion
codeclaw

CLI for exporting Claude Code/Codex sessions to Hugging Face with privacy redaction, MCP memory, and share-ready dataset workflows.

348 10 2
liuxiaotong
ai-dataset-radar

Multi-source async competitive intelligence engine for AI training data ecosystems with watermark-driven incremental scanning & anomaly detection. CLI + MCP ready.

347 2 1
mockloop
mockloop-mcp

MCP server to generate and run mock APIs from specifications.

304 15 6
mikl0s
lg3k

Log Generator 3000 - A modular log generation tool

283 1 0
leo-gan
pdf-anonymizer-cli

An app and an SDK to anonymize large PDF files

167 2 1
mockloop
iflow-mcp-mockloop-mockloop-mcp

MCP server to generate and run mock APIs from specifications.

152 15 6
leo-gan
pdf-anonymizer-core

An app and an SDK to anonymize large PDF files

147 2 1
abinashmeher999
srtvoiceext

A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.

127 19 5
mattijsmoens
veritas-truth-adapter

LoRA training data and adapter management for teaching AI models truthful, hedged responses. Built on SovereignShield's TruthGuard pipeline.

116 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery