Pdf Extraction Python Packages

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

165K 8K 472

opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

105K 20K 2K

oxidize-pdf

Python bindings for oxidize-pdf — generate, parse, split, merge & manipulate PDFs with native Rust performance. No C deps, no Java, no subprocesses.

19K 0 0

dd-agents

Find what gets buried in the data room. Open-source integrated M&A due diligence — 9 specialist domains across every contract, cross-referenced with exact citations.

11K 11 6

pytr

Use TradeRepublic in terminal and mass download all documents

8K 726 142

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

4K 62 6

langchain-opendataloader-pdf

A LangChain integration for OpenDataLoader PDF

3K 32 3

iterationlayer

Official Python SDK for the Iteration Layer API — document extraction, image transformation, image generation, document generation, and sheet generation.

2K 0 0

stmtforge

Open-source Python tool to parse credit card PDF statements from Indian banks (HDFC, ICICI, SBI, Axis + 5 more) into structured data. Offline, privacy-first, Streamlit dashboard. pip install stmtforge

698 0 0

phd-deepread-workflow

Transform academic PDFs into structured literature notes and critical-thinking canvases for Obsidian

488 36 2

pdfstructx

Intelligent PDF parser with font-aware structure detection, table extraction, and multi-column support

227 0 0

gdoczai

GDocz by Gramosoft is an open-source Intelligent Document Processing platform that turns raw PDFs and images into clean, structured JSON — powered by multi-engine OCR and AI-driven schema extraction.

207 6 1

mseep-kreuzberg

162 8K 477

ragtable-extract

PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

125 1 0

Search Packages