Big Data Python Packages

cython

The most widely used Python to C compiler

124.1M 11K 2K

pyspark

Apache Spark - A unified analytics engine for large-scale data processing

52.3M 43K 29K

delta-spark

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

36.1M 9K 2K

catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.4M 9K 1K

graphframes

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

2.9M 1K 266

synapseml

Simple and Distributed Machine Learning

2.1M 5K 861

pyspark-client

Apache Spark - A unified analytics engine for large-scale data processing

1.5M 43K 29K

koalas

Koalas: pandas API on Apache Spark

1.4M 3K 368

delta-sharing

An open protocol for secure data sharing

1.4M 939 225

graphframes-py

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs

1.3M 1K 266

uproot

ROOT I/O in pure Python and NumPy.

990K 264 94

daft

High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale

833K 5K 457

feast

The Open Source Feature Store for AI/ML

743K 7K 1K

nipype

Workflows and interfaces for neuroimaging packages

584K 825 544

starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

467K 12K 2K

awkward0

Manipulate arrays of complex data structures as easily as Numpy.

326K 214 39

uproot3

ROOT I/O in pure Python and NumPy.

326K 313 66

arcticdb

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.

304K 2K 178

h2o

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

244K 7K 2K

lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

151K 364 121

scikit-learn-intelex

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

140K 1K 185

hazelcast-python-client

Hazelcast Python Client

140K 116 78

daal

oneAPI Data Analytics Library (oneDAL)

132K 646 224

daal4py

Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

96K 1K 185