Text Extraction Python Packages

trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

7.2M 6K 363

justext

Heuristic based boilerplate removal tool

6.1M 818 89

srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

510K 530 53

html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

487K 694 55

tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

410K 2K 251

boilerpy3

Python port of Boilerpipe library

170K 96 17

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

165K 8K 472

sumy

Module for automatic summarization of text documents and HTML pages.

151K 4K 546

breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

100K 205 25

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

81K 717 78

python-hwpx

Pure Python HWPX automation: read, edit, generate, and validate documents without Hancom Office.

35K 68 29

liteparse

A fast, helpful, and open-source document parser

30K 5K 326

mobi

python based software to unpack kindlegen generated ebooks

30K 77 10

bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

14K 1 0

galeodes

Browsers options

8K 0 0

fundus

A very simple news crawler with a funny name

5K 452 108

preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4

reap-pdf

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

2K 3 1

aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

2K 27 7

vision-parse

Parse PDF documents into markdown formatted content using Vision LLMs

2K 469 66

pattex

Regex-based pattern extraction library for Python — emails, URLs, phones, IPs, and more.

2K 0 0

yirabot

YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

1K 17 0

maalfrid-toolkit

Toolkit for the Målfrid project

1K 2 1

pnlp

NLP预/后处理工具。

1K 30 6

Search Packages