PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Processing Python Packages

Python packages with the GitHub topic data-processing. Sorted by relevance, with stars and monthly downloads.
pandera-dev
pandera

A light-weight, flexible, and expressive statistical data testing library

8.8M 4K 395
lithops-cloud
lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

151K 364 121
svenkreiss
pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

135K 271 45
NVIDIA
nvidia-nvimgcodec-cu12

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

126K 146 14
NVIDIA
nvidia-dali-cuda120

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

82K 6K 663
datachain-ai
datachain

Data Memory: the operational data context layer for AI agents - typed, versioned datasets over images, video, docs and tables

49K 3K 140
allenai
dolma

Data and tools for generating and inspecting OLMo pre-training data.

44K 1K 189
run-house
kubetorch

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

30K 1K 57
bytewax
bytewax

Python Stream Processing

29K 2K 109
run-house
runhouse

Distribute and run AI workloads on Kubernetes magically in Python, like PyTorch for ML infra.

27K 1K 57
python-bonobo
bonobo

Extract Transform Load for Python 3.5+

25K 2K 145
matthewdeanmartin
untruncate-json

Python library to repair truncated json. Translated directly from the typescript original version

16K 5 0
crate
cratedb-toolkit

CrateDB Toolkit, an SDK for CrateDB and CrateDB Cloud.

16K 11 4
pathwaycom
pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

15K 63K 2K
wq
itertable

⇔ IterTable is a Pythonic API for iterating through tabular data formats, including CSV, XLSX, XML, and JSON.

12K 53 11
NVIDIA
nvidia-nvimgcodec-cu11

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

11K 146 14
NVIDIA
nvidia-nvimgcodec-cu13

A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface

10K 146 14
tandav
pipe21

Simple functional pipes for python

8K 19 0
polyaxon
haupt

Lineage metadata API, artifacts streams, sandbox, API, and spaces for Polyaxon

8K 451 207
NVIDIA
nvidia-dali-nightly-cuda120

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

5K 6K 663
kmatarese
glide

Easy ETL

5K 17 2
CEA-MetroCarac
spectroview

SPECTROview : A Tool for Spectroscopic Data Processing and Visualization.

4K 4 0
abdubakr77
deepcsv

Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!

4K 4 2
CouncilDataProject
cdp-backend

Data storage utilities and processing pipelines used by CDP instances.

4K 23 27
    • Data from PyPI, GitHub, ClickHouse, and BigQuery