PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
pandera-dev
pandera

A light-weight, flexible, and expressive statistical data testing library

8.7M 4K 395
skrub-data
skrub

Machine learning with dataframes

206K 2K 214
voxel51
fiftyone

Refine high-quality datasets and visual AI models

179K 11K 752
voxel51
fiftyone-db

Refine high-quality datasets and visual AI models

169K 11K 752
cleanlab
cleanlab

Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

58K 11K 890
dv66
banglanum2words

Converts a Bangla numeric string to literal words.

18K 3 0
KulikDM
pythresh

Outlier Detection Thresholding

10K 137 4
RuedigerVoigt
userprovided

A Python package to check input for validity and plausibility. Convert input into standardized formats.

6K 1 0
ironmussa
optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

6K 2K 232
abdubakr77
deepcsv

Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!

5K 4 2
cleanlab
cleanlab-studio

Client interface to Cleanlab Studio

4K 31 10
desbordante
desbordante

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

3K 477 99
voxel51
fiftyone-db-ubuntu2204

Refine high-quality datasets and visual AI models

3K 11K 752
Open-DataFlow
open-dataflow

Modern Data Centric AI system for Large Language Models

3K 3K 315
Digital-Dermatology
selfclean

[NeurIPS 2024] 🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.

2K 37 2
yutanagano
tidytcells

Standardise TR/MH/IG data

2K 12 3
voxel51
fiftyone-desktop

FiftyOne Desktop

2K 11K 752
Renumics
sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

1K 63 3
hplt-project
opuscleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

1K 58 16
kemingy
plane

A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.

1K 11 2
johanneskasser
hdsemg-select

hdsemg-select package

1K 1 0
benzsevern
goldenflow

Data transformation toolkit — standardize, reshape, and normalize messy data. Python & TypeScript. 83 transforms, zero-config mode, MCP server, edge-safe. DQBench 100/100.

1K 1 0
jhd3197
tukuy

A flexible data transformation library with a plugin system

1K 3 0
Open-DataFlow
open-dataflow-adp

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

1K 3K 315
    • Data from PyPI, GitHub, ClickHouse, and BigQuery