Lakehouse Python Packages

starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

454K 12K 2K

pydoris-custom

Apache Doris is an easy-to-use, high performance and unified analytics database.

215K 15K 4K

pydoris

Apache Doris is an easy-to-use, high performance and unified analytics database.

93K 15K 4K

pysail

Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.

32K 2K 129

dbt-fabric-samdebruyn

Maintained and extended fork combining dbt-fabric and dbt-fabricspark

7K 9 2

lakehouse-plumber

The Metadata Driven framework for Databricks Lakeflow Declarative Pipelines (formerly Delta Live Tables). Metadata framework that generates production ready Pyspark code for Lakeflow Declarative Pipelines

5K 56 11

dbt-doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

4K 15K 4K

lakehouse-engine

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

4K 288 50

laketower

Oversee your lakehouse

3K 12 0

databend

Data Agent Ready Warehouse : One for Analytics, Search, AI, Python Sandbox. — rebuilt from scratch. Unified architecture on your S3.

2K 9K 870

etherealogic-aetheriaforge

Databricks-native intelligent data transformation engine — coherence-scored Bronze/Silver/Gold with entity resolution and temporal reconciliation in a single deployable product.

2K 1 0

apache-gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

2K 3K 818

ytsaurus-spyt

YTsaurus is a scalable and fault-tolerant open-source big data platform.

2K 2K 205

redpanda-polaris-catalog-python

Apache Polaris, the interoperable, open source catalog for Apache Iceberg

1K 2K 437

lakebench

A multi-modal Python library for benchmarking Azure lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.

1K 51 17

timeseries-table-format

Append-only time-series table format with gap/overlap tracking (Python bindings).

1K 12 1

ibm-watsonxdata-mcp-server

Model Context Protocol (MCP) server for IBM watsonx.data - enables AI assistants to query and explore lakehouse data Resources

1K 6 3

pyfluss

Apache Fluss (incubating) Python client

1K 47 39

doris-mcp-server

Enterprise-grade Model Context Protocol (MCP) server implementation for Apache Doris

644 291 79

space-datasets

Unified storage framework for the entire machine learning lifecycle

524 155 8

datacoolie

Metadata-driven ETL framework for portable data pipelines across Polars, Spark, Fabric, Databricks, and AWS.

364 0 0

python-for-fluss

Python bindings for Fluss

321 47 39

apache-polaris

Apache Polaris, the interoperable, open source catalog for Apache Iceberg

297 2K 437

ftm-lakehouse

Data standard and archive storage for structured FollowTheMoney data, leaked data, private and public document collections.

284 5 1

Search Packages