PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Preprocessing Python Packages

Python packages with the GitHub topic data-preprocessing. Sorted by relevance, with stars and monthly downloads.
skrub-data
skrub

Machine learning with dataframes

204K 2K 214
mikeqfu
pyhelpers

PyHelpers: An open-source toolkit for facilitating Python users' data manipulation tasks

5K 15 3
desbordante
desbordante

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

3K 477 99
lennymalard
melpy

A NumPy-based deep learning library for building neural networks. It features an automatic differentiation engine and supports training architectures like LSTM, CNN, and FNN.

3K 4 0
Mohan-Zhang-u
mzutils

Mohan Zhang's toolkit

2K 104 9
twardoch
split-markdown4gpt

A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

2K 29 3
d4rk-lucif3r
lucifer-ml

Semi-Auto Machine Learning Library by d4rk-lucif3r

2K 11 5
Eden-Kramer-Lab
loren-frank-data-processing

Python tools for reading in data from Loren Frank's lab

2K 9 1
maet3608
nutsml

Flow-based data pre-processing for deep learning

1K 31 10
johanneskasser
hdsemg-select

hdsemg-select package

1K 1 0
MusfiqDehan
data-preprocessors

🛠️An easy to use tool for Data Preprocessing specially for Text Preprocessing

1K 3 2
Clearbox-AI
clearbox-preprocessor

A fast and felxible data preprocessor based on polars.

1K 7 0
infinitode
duplipy

DupliPy is a quick and easy-to-use package that can handle text formatting and data augmentation tasks for NLP in Python, with added support for image augmentation.

1K 1 0
YakshHaranwala
ptrail

PTRAIL is a state-of-the art parallel computation library for Mobility Data Preprocessing and feature extraction.

708 27 7
Elysian01
data-purifier

A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.

659 45 7
TsLu1s
atlantic

Atlantic: Automated Data Preprocessing Framework for Machine Learning

585 32 7
Moenupa
deocr

A high-performance highly-customizable reverse OCR tool that renders text or huggingface-compatible datasets to images. Dimension, DPI, CSS configurable!

580 2 0
msamogh
nonechucks

Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

441 378 27
rannd1nt
phaeton

A high-performance preprocessing and ETL engine python library for sanitizing raw data streams, accelerated by Rust.

403 2 0
michaelscutari
protclust

protclust is a Python library for protein sequence analysis that integrates MMseqs2 for fast clustering and provides tools for creating robust machine learning datasets. It offers cluster-aware data splitting to prevent sequence similarity bias in model evaluation, along with comprehensive protein embedding capabilities for feature generation.

389 4 0
kozodoi
dptools

Data Preprocessing Tools

359 5 3
ksbg
sparklanes

A lightweight data processing framework for Apache Spark

322 16 5
GCousido
vision-converter

This project consist of a library and a CLI for converting datasets between annotation formats.

318 3 0
nxank4
loclean

High-performance, local-first semantic data cleaning library

312 10 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery