PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Curation Python Packages

Python packages with the GitHub topic data-curation. Sorted by relevance, with stars and monthly downloads.
voxel51
fiftyone

Refine high-quality datasets and visual AI models

178K 11K 752
voxel51
fiftyone-db

Refine high-quality datasets and visual AI models

171K 11K 752
cleanlab
cleanlab

Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

58K 11K 890
visualdatabase
fastdup

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

10K 2K 87
cleanlab
cleanlab-studio

Client interface to Cleanlab Studio

4K 31 10
voxel51
fiftyone-db-ubuntu2204

Refine high-quality datasets and visual AI models

3K 11K 752
Digital-Dermatology
selfclean

[NeurIPS 2024] πŸ§ΌπŸ”Ž A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors.

2K 37 2
voxel51
fiftyone-desktop

FiftyOne Desktop

1K 11K 752
TieuLongPhan
synrbl

Rebalancing chemical reaction

1K 29 2
Renumics
sliceguard

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

1K 63 3
KenObata
distributed-curator

Partition-aware MinHash LSH deduplication library for large-scale text data curation on Apache Spark.

946 1 0
PennLINC
cubids

Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.

628 30 13
aminnaghdloo
annotate-ez

High-throughput curation and visualization of large-scale single-cell microscopy images, in a lightweight GUI.

621 1 0
voxel51
fiftyone-db-ubuntu2004

Refine high-quality datasets and visual AI models

512 11K 752
cleanlab
cleanlab-cli

Client interface to Cleanlab Studio

501 31 10
UAL-RE
ldcoolp-figshare

Python tool using the Figshare API for data curation

375 3 1
bluestero
urlgenie

Python package to make URL extraction, generalization, validation, and filtration easy.

373 4 1
cleanlab
example-package-elisno

The standard package for data-centric AI, machine learning with label errors, and automatically finding and fixing dataset issues in Python.

277 11K 890
Docta-ai
docta-ai

Docta.ai

215 3K 256
NVIDIA
invisible-rabbit

Scalable data pre processing and curation toolkit for LLMs

185 2K 264
voxel51
fiftyone-db-debian9

FiftyOne DB

166 11K 752
voxel51
fiftyone-db-ubuntu1604

Project FiftyOne database

131 11K 752
voxel51
fiftyone-db-rhel7

Refine high-quality datasets and visual AI models

121 11K 752
LaureBerti
learn2clean

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

98 54 20
    • Data from PyPI, GitHub, ClickHouse, and BigQuery