PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
narwhals-dev
narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

82.4M 2K 190
MrPowers
chispa

PySpark test helper methods with beautiful error messages

2.9M 763 78
graphframes
graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.9M 1K 266
capitalone
datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

2.6M 639 160
Microsoft
synapseml

Simple and Distributed Machine Learning

2.2M 5K 861
ibis-project
ibis-framework

the portable Python dataframe library

1.9M 7K 716
jupyter-incubator
hdijupyterutils

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448
jupyter-incubator
autovizwidget

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448
jelmerk
pyspark-hnsw

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

1.3M 303 59
graphframes
graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

1.2M 1K 266
JohnSnowLabs
spark-nlp

State of the Art Natural Language Processing

1.1M 4K 743
Nike-Inc
koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

817K 652 39
MrPowers
quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

635K 687 95
zero323
pyspark-stubs

Apache (Py)Spark type annotations (stub files).

358K 118 36
debugger24
pyspark-test

Testing library for pyspark, inspired from pandas testing module but for pyspark, to help users write unit tests.

333K 21 5
crflynn
pbspark

protobuf pyspark conversion

294K 23 6
uber
petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

287K 2K 284
asuiu
sparkorm

ORM for Apache Spark and DataFrames schema manager

286K 16 3
databrickslabs
dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

274K 460 93
CamDavidsonPilon
tdigest

t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark

213K 406 53
canimus
cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

106K 243 22
zalmane
copybook

Python copybook parser

96K 32 10
G-Research
pyspark-extension

A library that provides useful extensions to Apache Spark and PySpark.

87K 236 30
h2oai
h2o-pysparkling-3-1

Sparkling Water provides H2O functionality inside Spark cluster

79K 977 361
    • Data from PyPI, GitHub, ClickHouse, and BigQuery