Document Parsing Python Packages

docling

Get your documents ready for gen AI

6M 59K 4K

unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

5.2M 15K 1K

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2M 77K 10K

docling-slim

Get your documents ready for gen AI

206K 59K 4K

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

105K 20K 2K

openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

6K 1K 125

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

4K 62 6

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

3K 32 3

flamehaven-filesearch

FLAMEHAVEN FileSearch - Open source semantic document search with multi-provider LLM support (Gemini, OpenAI, Claude, Ollama)

3K 98 13

unstructured-cpu

3K 15K 1K

acatome-extract

PDF extraction pipeline for acatome — Marker/fitz, metadata, block chunking

3K 0 0

docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 129

olgadoc

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

2K 6 0

longparser

Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.

2K 15 1

churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

1K 38 4

tikara

The metadata and text content extractor for almost every file type.

537 9 0

fadoudou2

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

535 77K 10K

grits-metric

GriTS metric for table extraction

448 2 0

pdfstructx

Intelligent PDF parser with font-aware structure detection, table extraction, and multi-column support

227 0 0

doc23

Convert PDFs, DOCX, TXT & more into structured JSON trees using Python. Built for legal, institutional and NLP applications.

215 0 0

je-paddleocr

Awesome OCR toolkits based on PaddlePaddle(8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices)

214 77K 10K

langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

192 77K 10K

docling-google-ocr

Get your documents ready for gen AI

179 59K 4K

pdf-bank-statement-parser

Command-line tool for converting PDF bank statements into CSV

164 6 5

Search Packages