PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Big Data Python Packages

Python packages with the GitHub topic big-data. Sorted by relevance, with stars and monthly downloads.
cython
cython

The most widely used Python to C compiler

124.1M 11K 2K
apache
pyspark

Apache Spark - A unified analytics engine for large-scale data processing

52.3M 43K 29K
delta-io
delta-spark

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

36.1M 9K 2K
catboost
catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.4M 9K 1K
graphframes
graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.9M 1K 266
Microsoft
synapseml

Simple and Distributed Machine Learning

2.1M 5K 861
apache
pyspark-client

Apache Spark - A unified analytics engine for large-scale data processing

1.5M 43K 29K
databricks
koalas

Koalas: pandas API on Apache Spark

1.4M 3K 368
delta-io
delta-sharing

An open protocol for secure data sharing

1.4M 939 225
graphframes
graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

1.3M 1K 266
scikit-hep
uproot

ROOT I/O in pure Python and NumPy.

990K 264 94
Eventual-Inc
daft

High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

833K 5K 457
feast-dev
feast

The Open Source Feature Store for AI/ML

743K 7K 1K
nipy
nipype

Workflows and interfaces for neuroimaging packages

584K 825 544
starrocks
starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

467K 12K 2K
scikit-hep
awkward0

Manipulate arrays of complex data structures as easily as Numpy.

326K 214 39
scikit-hep
uproot3

ROOT I/O in pure Python and NumPy.

326K 313 66
man-group
arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

304K 2K 178
h2oai
h2o

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

244K 7K 2K
lithops-cloud
lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

151K 364 121
uxlfoundation
scikit-learn-intelex

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

140K 1K 185
hazelcast
hazelcast-python-client

Hazelcast Python Client

140K 116 78
uxlfoundation
daal

oneAPI Data Analytics Library (oneDAL)

132K 646 224
IntelPython
daal4py

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

96K 1K 185
    • Data from PyPI, GitHub, ClickHouse, and BigQuery