Document Intelligence Python Packages

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

167K 8K 472

paddlenlp

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

36K 13K 3K

tool-helpers

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

10K 13K 3K

contextgem

ContextGem: Effortless LLM extraction from documents

9K 2K 155

fast-dataindex

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

8K 13K 3K

vectorless

Knowing by reasoning, not vectors. ⭐ Star this repo if you find it useful.

6K 29 2

preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4

longparser

Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.

2K 15 1

infinity-parser2

INF Tech's open-source MLLMs for SOTA visual-language understanding and advanced document intelligence.

1K 141 13

fast-tokenizer-python

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

1K 13K 3K

faster-tokenizer

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

880 13K 3K

document-intelligence-mcp

Local document intelligence MCP server — extract text, tables, metadata from PDF and DOCX. No API key needed.

612 0 0

knwler

Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.

550 123 10

langchain-kreuzberg

Kreuzberg document loader for LangChain — extract text from 88+ file formats with true async and rich metadata

502 4 0

paddle-pipelines

Paddle-Pipelines: An End to End Natural Language Proceessing Development Kit Based on PaddleNLP

425 13K 3K

ks-xlsx-parser

XLSX parser for LLMs, RAG, LangChain, LangGraph, CrewAI, Claude, MCP — turns Excel (.xlsx) into citation-ready JSON with formulas, charts, dependency graphs, and token-counted chunks. Open-source Python library (MIT).

410 17 2

faster-tokenizers

PaddleNLP Faster Tokenizer Library written in C++

257 13K 3K

decompose-mcp

The missing cognitive primitive for AI agents. Structured intelligence from any text.

210 9 2

mseep-kreuzberg

172 8K 477