PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
adbar
trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

7.2M 6K 363
miso-belica
justext

Heuristic based boilerplate removal tool

6.1M 818 89
cdown
srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

510K 530 53
kreuzberg-dev
html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

487K 694 55
chrismattmann
tika

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

410K 2K 251
jmriebold
boilerpy3

Python port of Boilerpipe library

170K 96 17
kreuzberg-dev
kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

165K 8K 472
miso-belica
sumy

Module for automatic summarization of text documents and HTML pages.

151K 4K 546
bookieio
breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

100K 205 25
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

81K 717 78
airmang
python-hwpx

Pure Python HWPX automation: read, edit, generate, and validate documents without Hancom Office.

35K 68 29
run-llama
liteparse

A fast, helpful, and open-source document parser

30K 5K 326
iscc
mobi

python based software to unpack kindlegen generated ebooks

30K 77 10
harubi
bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

14K 1 0
qeeqbox
galeodes

Browsers options

8K 0 0
flairNLP
fundus

A very simple news crawler with a funny name

5K 452 108
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
kennipj
reap-pdf

Rust-first PDF text extraction with geometry-aware search and optional Python bindings

2K 3 1
amenezes
aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

2K 27 7
iamarunbrahma
vision-parse

Parse PDF documents into markdown formatted content using Vision LLMs

2K 469 66
meer-khan
pattex

Regex-based pattern extraction library for Python — emails, URLs, phones, IPs, and more.

2K 0 0
OwenOrcan
yirabot

YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

1K 17 0
NationalLibraryOfNorway
maalfrid-toolkit

Toolkit for the Målfrid project

1K 2 1
hscspring
pnlp

NLP预/后处理工具。

1K 30 6
    • Data from PyPI, GitHub, ClickHouse, and BigQuery