Data Wrangling Python Packages

skrub

Machine learning with dataframes

206K 2K 214

optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

6K 2K 232

desbordante

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

3K 477 99

omnipy

Omnipy is a high level Python library for type-driven data wrangling and scalable workflow orchestration (under development)

3K 26 1

whyqd

data wrangling simplicity, complete audit transparency, and at speed

2K 35 1

hypertools

A python package for visualizing and manipulating high-dimensional data

2K 2K 162

datagrunt

Datagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.

1K 10 2

data-toolz

simple python library for handling data-io tasks

1K 7 0

pyoptimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

1K 2K 232

anonymized-fraud-detection

A small package to parse and train an ML model for anonymized credit card transactions. Refer to github wikis for more details. Package was built for PythonVirtualenvOperator() on GCP Airflow.

813 2 1

monggregate

MongoDB aggregation pipelines made easy. Joins, grouping, counting and much more...

733 22 3

csv-trimming

Package python to remove common ugliness from a csv-like file

651 106 0

pydata-wrangler

Wrangle messy data into DataFrames (pandas or Polars), with a special focus on text data and natural language processing

460 10 2

skloverlay

This repository is the official location of the SKLOverlay Project. Here, it will hold everything used for the package on Py Pi, including source files.

427 0 0

datasetops

Fluent dataset operations, compatible with your favorite libraries

346 11 0

loclean

High-performance, local-first semantic data cleaning library

314 10 1

prosto

Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby

240 93 5

excel-toolkit-cwd

Command-line toolkit for Excel data manipulation and analysis

239 0 0

ics-fixer

Fix slightly broken iCalendar files

190 0 0

pandance

Advanced relational operations for pandas DataFrames

181 5 0

frameon

🐼✨ Frameon - enhances pandas DataFrames with analysis methods while preserving all native functionality

179 4 2

gis-conflation-toolchain

[EARLY-DRAFT] See geojson-diff.py from https://github.com/fititnt/openstreetmap-vs-dados-abertos-brasil

174 0 0

data-cleaning

Data Cleaning is a python package for data preprocessing. This cleans the CSV file and returns the cleaned data frame. It does the work of imputation, removing duplicates, replacing special characters, and many more.

171 9 4

databroom

A cross-language DataFrame cleaning assistant with interactive GUI and one-click code export

169 7 0

Search Packages