Data Mining Python Packages

lightgbm

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

15.3M 18K 4K

catboost

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

6.4M 9K 1K

gensim

Topic Modelling for Humans

5.1M 16K 4K

easyocr

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

2.8M 29K 4K

pyod

A Python library for anomaly detection across tabular, time series, graph, text, and image data. 60+ detectors, benchmark-backed ADEngine orchestration, and an agentic workflow for AI agents.

2.7M 10K 1K

sktime

A unified framework for machine learning with time series

1.2M 10K 2K

mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

1.1M 5K 903

matminer

Data mining for materials science

722K 584 211

clevercsv

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

401K 1K 80

textract

extract text from any document. no muss. no fuss.

383K 5K 675

pyprobables

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

234K 123 12

tsdb

a Python toolbox loads 173 public time series datasets for machine/deep learning with a single line of code. Datasets from multiple domains including healthcare, financial, power, traffic, weather, and etc.

127K 234 23

pygrinder

PyGrinder: a Python toolkit for grinding data beans into the incomplete for real-world data simulation by introducing missing values with different missingness patterns, including MCAR (complete at random), MAR (at random), MNAR (not at random), sub sequence missing, and block missing

122K 65 6

pypots

A Python toolkit/library for reality-centric machine/deep learning & data mining on partially-observed time series, with 50+ SOTA neural network models for scientific analysis tasks (imputation, classification, clustering, forecasting, anomaly detection, cleaning) on incomplete industrial irregularly-sampled multivariate TS with NaN missing values

122K 2K 184

orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

100K 6K 1K

pm4py

Official public repository for PM4Py (Process Mining for Python) — an open-source library for exploring, analyzing, and optimizing business processes with Python.

98K 948 347

unitypy

UnityPy is python module that makes it possible to extract/unpack and edit Unity assets

87K 1K 180

prefixspan

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

61K 426 93