PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Preparation Python Packages

Python packages with the GitHub topic data-preparation. Sorted by relevance, with stars and monthly downloads.
skrub-data
skrub

Machine learning with dataframes

204K 2K 214
ironmussa
optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

6K 2K 232
amphi-ai
jupyterlab-amphi

visual data prep powered by python

3K 1K 106
snowmuffin
convmerge

Merge heterogeneous chat/text sources into a single LLM training format (JSONL)

2K 0 1
sisinflab
datarec-lib

Compatibility wrapper for the renamed DataRec package.

2K 20 1
johanneskasser
hdsemg-select

hdsemg-select package

1K 1 0
hi-primus
pyoptimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

1K 2K 232
amphi-ai
amphi-scheduler

Amphi Scheduler (JupyterLab extension + Python backend)

919 1K 106
sisinflab
datarec

A Python Library for Standardized and Reproducible Data Management in Recommender Systems

804 20 1
tracebloc
tracebloc-ingestor

tracebloc data pipeline for training/test dataset setup

496 8 0
CyberCRI
refinedoc

Python library for post-extraction refinement of text that may be derived from PDF extraction.

447 26 3
developmentseed
label-maker

Data Preparation for Satellite Machine Learning

396 469 107
kozodoi
dptools

Data Preprocessing Tools

359 5 3
Florian-Katerndahl
forestiler

Create Image Tiles From Large Input Rasters According to a Classified Mask Vector File

268 0 0
asavinov
prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

251 93 5
dataclr
dataclr

A Python library for feature selection in tabular datasets

233 20 2
ixlan
machine-learning-data-pipeline

Pipeline module for parallel real-time data processing for machine learning models development and production purposes.

212 22 2
ved93
ml-express

A Python library for day to day data analysis and machine learning.

195 3 1
NVIDIA
invisible-rabbit

Scalable data pre processing and curation toolkit for LLMs

185 2K 264
maksymsur
spltr

A simple PyTorch-based data loader and splitter

153 1 0
alihanozz
daxpy

A pre-machine-learning model package

86 0 0
NVIDIA
invisible-unicorn

Scalable data pre processing and curation toolkit for LLMs

77 2K 264
NVIDIA
lava-ray

Scalable data pre processing and curation toolkit for LLMs

1 2K 264
    • Data from PyPI, GitHub, ClickHouse, and BigQuery