PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Document Intelligence Python Packages

Python packages with the GitHub topic document-intelligence. Sorted by relevance, with stars and monthly downloads.
kreuzberg-dev
kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

167K 8K 472
PaddlePaddle
paddlenlp

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

36K 13K 3K
PaddlePaddle
tool-helpers

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

10K 13K 3K
shcherbak-ai
contextgem

ContextGem: Effortless LLM extraction from documents

9K 2K 155
PaddlePaddle
fast-dataindex

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

8K 13K 3K
vectorlessflow
vectorless

Knowing by reasoning, not vectors. ⭐ Star this repo if you find it useful.

6K 29 2
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
ENDEVSOLS
longparser

Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.

2K 15 1
infly-ai
infinity-parser2

INF Tech's open-source MLLMs for SOTA visual-language understanding and advanced document intelligence.

1K 141 13
PaddlePaddle
fast-tokenizer-python

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

1K 13K 3K
PaddlePaddle
faster-tokenizer

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

880 13K 3K
AiAgentKarl
document-intelligence-mcp

Local document intelligence MCP server — extract text, tables, metadata from PDF and DOCX. No API key needed.

612 0 0
Orbifold
knwler

Knwler is a lightweight, single-file Python tool that extracts structured knowledge graphs from documents using AI. Feed it a PDF or text file and receive a richly connected network of entities, relationships, and topics — complete with an interactive HTML report and exports ready for your favorite graph analytics platform.

550 123 10
kreuzberg-dev
langchain-kreuzberg

Kreuzberg document loader for LangChain — extract text from 88+ file formats with true async and rich metadata

502 4 0
PaddlePaddle
paddle-pipelines

Paddle-Pipelines: An End to End Natural Language Proceessing Development Kit Based on PaddleNLP

425 13K 3K
arnav2
ks-xlsx-parser

XLSX parser for LLMs, RAG, LangChain, LangGraph, CrewAI, Claude, MCP — turns Excel (.xlsx) into citation-ready JSON with formulas, charts, dependency graphs, and token-counted chunks. Open-source Python library (MIT).

410 17 2
PaddlePaddle
faster-tokenizers

PaddleNLP Faster Tokenizer Library written in C++

257 13K 3K
echology-io
decompose-mcp

The missing cognitive primitive for AI agents. Structured intelligence from any text.

210 9 2
Goldziher
mseep-kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

172 8K 477
    • Data from PyPI, GitHub, ClickHouse, and BigQuery