PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Spark Python Packages

Python packages with the GitHub topic spark. Sorted by relevance, with stars and monthly downloads.
tobymao
sqlglot

Python SQL Parser and Transpiler

61.5M 9K 1K
apache
pyspark

Apache Spark - A unified analytics engine for large-scale data processing

52.3M 43K 29K
delta-io
delta-spark

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

36.1M 9K 2K
tobymao
sqlglotrs

Python SQL Parser and Transpiler

6.8M 9K 1K
databrickslabs
databricks-labs-dqx

Databricks framework to validate Data Quality of pySpark DataFrames and Tables

5.2M 405 111
fugue-project
fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

3.2M 2K 100
graphframes
graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.9M 1K 266
combust
mleap

MLeap: Deploy ML Pipelines to Production

2.7M 2K 316
capitalone
datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

2.6M 639 160
Microsoft
synapseml

Simple and Distributed Machine Learning

2.1M 5K 861
jupyter-incubator
hdijupyterutils

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448
jupyter-incubator
autovizwidget

Jupyter magics and kernels for working with remote Spark clusters

1.7M 1K 448
dylan-profiler
visions

Type System for Data Analysis in Python

1.6M 217 20
lucacanali
sparkmeasure

This repository contains the development code for sparkMeasure, an Apache Spark performance analysis and troubleshooting library. It simplifies collecting, aggregating, and exporting Spark task/stage metrics, and is designed for practical use by developers and data engineers in interactive analysis, testing, and production monitoring workflows.

1.5M 821 160
apache
pyspark-client

Apache Spark - A unified analytics engine for large-scale data processing

1.5M 43K 29K
databricks
koalas

Koalas: pandas API on Apache Spark

1.4M 3K 368
delta-io
delta-sharing

An open protocol for secure data sharing

1.4M 939 225
graphframes
graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

1.3M 1K 266
jelmerk
pyspark-hnsw

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

1.3M 303 59
tobymao
sqlglotc

Python SQL Parser and Transpiler

1.2M 9K 1K
fugue-project
fugue-sql-antlr

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

1.2M 2K 100
JohnSnowLabs
spark-nlp

State of the Art Natural Language Processing

1.1M 4K 743
moj-analytical-services
splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

740K 2K 234
flyteorg
flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

549K 315 340
    • Data from PyPI, GitHub, ClickHouse, and BigQuery