PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Dataset Generation Python Packages

Python packages with the GitHub topic dataset-generation. Sorted by relevance, with stars and monthly downloads.
nfstream
nfstream

NFStream: a Flexible Network Data Analysis Framework.

19K 1K 143
hearmeneigh
datasetrising

Toolchain for creating custom datasets and training Stable Diffusion (1.x, 2.x, XL) models and LoRAs

9K 18 1
lightning-rod-labs
lightningrod-ai

Python SDK for dataset generation on LightningRod platform ⚡

6K 44 3
Kiln-AI
kiln-ai

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

3K 5K 361
HZYAI
ragscore

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

2K 31 5
scalexi
scalexi

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs).

2K 13 2
colddsam
modeyolo

ModeYOLO is a versatile Python package designed for efficient color space transformations and simplified dataset modification for deep learning applications. Seamlessly integrating into your workflow, this package empowers users to effortlessly perform diverse color operations and streamline the creation of modified datasets, enhancing the flexibility and convenience of machine learning model training processes.

2K 0 0
Kiln-AI
kiln-server

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

1K 5K 361
DIYer22
bpycv

Computer vision utils for Blender.

1K 501 60
Superuser666-Sigil
human-eval-rust

SigilDERG Data Production is an enterprise-grade Rust pipeline that crawls crates, runs rigorous scans (Clippy, Geiger, license checks), and generates instruction-style JSONL shards. It features semantic chunking, configurable splits, observability, and seamless SigilDERG ecosystem integration.

1K 0 1
MatteoGuadrini
pyreports

pyreports is a python library that allows you to create complex report from various sources

963 113 9
StarlangSoftware
nlptoolkit-datagenerator

Classification dataset generator library for high level Nlp tasks

800 3 0
JaonHax
scpscraper

A Python library designed for scraping data from the SCP wiki.

743 16 4
SimGus
chatette

A powerful dataset generator for Rasa NLU, inspired by Chatito

736 315 53
facebookresearch
stopes

A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.

711 303 46
christiangarcia0311
data-seed-ph

A Python library for generating realistic, synthetic Philippine-based datasets.

690 8 0
OmarSamirz
iftg

IFTG (ImageFromTextGenerator) is a Python package that simplifies creating robust datasets for OCR models. Generate images from text, apply over 10 built-in noise effects, and customize fonts and layouts. IFTG supports all languages and offers endless noise combinations, including custom noise creation.

674 21 2
OOXXXXOO
d-arth

DATASETS FOR WHOLE E-ARTH

603 9 7
4thel00z
ccdown

A rust based, resumable downloader cli and python library for Common Crawl data

567 0 0
ElementAI
synbols

The Synbols dataset generator is a ServiceNow Research project that was started at Element AI.

544 45 6
radi-cho
datasetgpt

Generate textual and conversational datasets with LLMs.

537 298 19
TimeEval
timeeval-gutentag

A good Timeseries Anomaly Generator.

505 95 17
StarlangSoftware
nlptoolkit-datagenerator-cy

Classification dataset generator library for high level Nlp tasks

496 0 0
C-you-know
ks-llm-ranker

Ranking Large Language Models using the Principle of Least Action! Built during my time at Knit Space, Hubbali under the guidance Prof. Prakash Hegade.

487 5 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery