Pdf Parser Python Packages

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

57.7M 10K 2K

pypdf2

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

25.5M 10K 2K

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2.1M 77K 10K

mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

283K 62K 5K

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

112K 20K 2K

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

92K 717 78

magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

76K 62K 5K

extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

58K 2K 96

liteparse

A fast, helpful, and open-source document parser

31K 5K 326

oxidize-pdf

Python bindings for oxidize-pdf — generate, parse, split, merge & manipulate PDFs with native Rust performance. No C deps, no Java, no subprocesses.

20K 0 0

casparser

Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech

15K 194 79

mineru-selfhosted-mcp

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

4K 62K 5K

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

4K 32 3

pdfalyzer

Analyze PDFs with colors (and YARA)

3K 366 25

scipdf-parser

Python PDF parser for scientific publications: content and figures

3K 452 65

pdfmark-ai

Convert PDF files to high-quality Markdown using LLM vision models

3K 0 0

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

2K 661 52

pyxpdf

Fast and memory-efficient Python PDF Parser based on xpdf sources

2K 44 17

docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 129