PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Text Mining Python Packages

Python packages with the GitHub topic text-mining. Sorted by relevance, with stars and monthly downloads.
adbar
trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

7.5M 6K 363
deanmalmgren
textract

extract text from any document. no muss. no fuss.

386K 5K 675
csurfer
rake-nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

278K 1K 151
bookieio
breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

101K 205 25
Lilykos
pyphonetics

A Python 3 phonetics library.

86K 139 21
Lips7
matcher-py

A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.

84K 18 1
KyleKing
textract-py3

Maintained fork of deanmalmgren/textract to replace '*' dependencies and other updates

54K 14 2
JasonKessler
scattertext

Beautiful visualizations of how language differs among document types.

20K 2K 286
aphp
edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

18K 163 41
biolab
orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3

11K 134 86
averbis
averbis-python-api

Conveniently access the REST API of Averbis products using Python

6K 12 5
vmenger
deduce

Deduce: de-identification method for Dutch medical text

5K 64 27
mesejo
trrex

Efficient string matching with regular expressions

3K 146 7
PetrKorab
arabica

Python package for text mining of time-series data

3K 75 16
huspacy
huspacy

HuSpaCy: industrial-strength Hungarian natural language processing

3K 182 18
huspacy
huspacy-nightly

HuSpaCy: industrial-strength Hungarian natural language processing

3K 182 18
rosette-api
rosette-api

Babel Street Analytics Client Library for Python

2K 38 37
stephenhky
shorttext

Various Algorithms for Short Text Mining

2K 471 74
lasigeBioTM
bent

Biomedical Term Annotator

2K 9 1
vgrabovets
multi-rake

Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python

2K 272 37
sergioburdisso
pyss3

A Python library for Interpretable Machine Learning in Text Classification using the SS3 model, with easy-to-use visualization tools for Explainable AI :octocat:

2K 348 44
jbesomi
texthero

Text preprocessing, representation and visualization from zero to hero.

2K 3K 237
ronaldgosso
semantic-keywords

TF-IDF counts words. semantic-keywords understands meaning. It uses sentence embeddings (all-MiniLM-L6-v2 by default) and Maximal Marginal Relevance (MMR) to return keywords that are both relevant and diverse — not just the most frequent phrases. Works fully offline after a one-time model download. No API key. No rate limits.

1K 0 0
cgshep
pyeditdistance

A pure, minimalist, no-dependency Python library of various edit distances.

1K 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery