Document Processing Python Packages

llama-cloud

Python SDK for OCR and document parsing in the cloud with LlamaParse

8.1M 28 7

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

81K 717 78

pyrhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

44K 103 14

liteparse

A fast, helpful, and open-source document parser

30K 5K 326

docling-ocr-onnxtr

OnnxTR OCR plugin for Docling

26K 19 0

template-engine-ia

Document normalization engine — learn a template from examples and convert any document automatically via LLM

10K 1 0

openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

6K 1K 125

asset-aware-mcp

Asset-Aware MCP Server — AI Agent precisely accesses tables, figures, sections from PDFs + .docx round-trip editing (DFM) with 46 tools / 13 resources, segmentation export, layout overlay, OCR preprocessing, knowledge graph (LightRAG)

5K 0 0

pdf-mcp

MCP server that lets Claude Code and other AI agents read large PDFs without hitting context limits. Chunked reading, hybrid search, OCR, table and image extraction, SQLite cache.

4K 25 4

qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

3K 39 24

preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4

pdfmark-ai

Convert PDF files to high-quality Markdown using LLM vision models

3K 0 0

appmod-catalog-blueprints

Library of composable well-architected CDK solution blueprints for real-world use cases — AI agents, document processing, serverless applications and more. Deploy or customize on AWS in minutes.

3K 11 1

docling-graph

Transform unstructured documents into validated, rich and queryable knowledge graphs.

3K 139 21

sieves

Plug-and-play document AI with zero-shot models.

2K 124 8

qdrant-loader-mcp-server

2K 39 24

qdrant-loader-core

Shared core for provider-agnostic LLM support and configuration mapping for qdrant-loader ecosystem

2K 39 24

socr

Multi-engine document OCR with cascading fallback

2K 2 0

dm2xcod

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

2K 4 0

pdf-form-tools

Template-aware tools for filling scanned PDF forms with visual verification

2K 0 0

openextract

Extract structured data from documents, images, audio, and video using LLMs.

1K 16 2

deepseek-ocr-cli

CLI tool for OCR using DeepSeek-OCR model via Ollama

942 11 3

omnichunk

Structure-aware text chunking library for code, prose, and markup files. Intelligently splits files into context-rich chunks while preserving semantic boundaries. Supports 15+ programming languages, deterministic output, and zero external dependencies. Perfect for RAG systems, code analysis, and LLM context optimization.

898 8 0

office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

878 8 0

Search Packages