Pyspark Python Packages

narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

82.4M 2K 190

chispa

PySpark test helper methods with beautiful error messages

2.9M 763 78

graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.9M 1K 266

datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

2.6M 639 160

synapseml

Simple and Distributed Machine Learning

2.2M 5K 861

ibis-framework

the portable Python dataframe library

1.9M 7K 716

hdijupyterutils

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448

autovizwidget

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448

pyspark-hnsw

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

1.3M 303 59

graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

1.2M 1K 266

spark-nlp

State of the Art Natural Language Processing

1.1M 4K 743

koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

817K 652 39

quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

635K 687 95

pyspark-stubs

Apache (Py)Spark type annotations (stub files).

358K 118 36

pyspark-test

Testing library for pyspark, inspired from pandas testing module but for pyspark, to help users write unit tests.

333K 21 5

pbspark

protobuf pyspark conversion

294K 23 6

petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

287K 2K 284

sparkorm

ORM for Apache Spark and DataFrames schema manager

286K 16 3

dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

274K 460 93