PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Mining Python Packages

Python packages with the GitHub topic data-mining. Sorted by relevance, with stars and monthly downloads.
microsoft
lightgbm

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

15.3M 18K 4K
catboost
catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.4M 9K 1K
RaRe-Technologies
gensim

Topic Modelling for Humans

5.1M 16K 4K
jaidedai
easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

2.8M 29K 4K
yzhao062
pyod

A Python library for anomaly detection across tabular, time series, graph, text, and image data. 60+ detectors, benchmark-backed ADEngine orchestration, and an agentic workflow for AI agents.

2.7M 10K 1K
sktime
sktime

A unified framework for machine learning with time series

1.2M 10K 2K
rasbt
mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

1.1M 5K 903
hackingmaterials
matminer

Data mining for materials science

722K 584 211
alan-turing-institute
clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

401K 1K 80
deanmalmgren
textract

extract text from any document. no muss. no fuss.

383K 5K 675
barrust
pyprobables

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

234K 123 12
WenjieDu
tsdb

a Python toolbox loads 173 public time series datasets for machine/deep learning with a single line of code. Datasets from multiple domains including healthcare, financial, power, traffic, weather, and etc.

127K 234 23
WenjieDu
pygrinder

PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns, including MCAR (complete at random), MAR (at random), MNAR (not at random), sub sequence missing, and block missing

122K 65 6
WenjieDu
pypots

A Python toolkit/library for reality-centric machine/deep learning & data mining on partially-observed time series, with 50+ SOTA neural network models for scientific analysis tasks (imputation, classification, clustering, forecasting, anomaly detection, cleaning) on incomplete industrial irregularly-sampled multivariate TS with NaN missing values

122K 2K 184
biolab
orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

100K 6K 1K
process-intelligence-solutions
pm4py

Official public repository for PM4Py (Process Mining for Python) — an open-source library for exploring, analyzing, and optimizing business processes with Python.

98K 948 347
K0lb3
unitypy

UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

87K 1K 180
chuanconggao
prefixspan

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

61K 426 93
KyleKing
textract-py3

Maintained fork of deanmalmgren/textract to replace '*' dependencies and other updates

54K 14 2
tommyod
efficient-apriori

An efficient Python implementation of the Apriori algorithm.

50K 347 61
catboost
catboost-dev

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

45K 9K 1K
aeon-toolkit
aeon

A toolkit for time series machine learning and deep learning

45K 1K 268
tommyod
paretoset

Compute the Pareto (non-dominated) set, i.e., skyline operator/query.

36K 68 5
exaxorg
accelerator

The Accelerator is a tool for fast and reproducible processing of eBay-scale datasets on a single computer.

36K 4 2
    • Data from PyPI, GitHub, ClickHouse, and BigQuery