PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
run-llama
llama-cloud

Python SDK for OCR and document parsing in the cloud with LlamaParse

8.1M 28 7
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

81K 717 78
awslabs
pyrhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

44K 103 14
run-llama
liteparse

A fast, helpful, and open-source document parser

30K 5K 326
felixdittrich92
docling-ocr-onnxtr

OnnxTR OCR plugin for Docling

26K 19 0
Luizhcrs
template-engine-ia

Document normalization engine — learn a template from examples and convert any document automatically via LLM

10K 1 0
Topdu
openocr-python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

6K 1K 125
u9401066
asset-aware-mcp

Asset-Aware MCP Server — AI Agent precisely accesses tables, figures, sections from PDFs + .docx round-trip editing (DFM) with 46 tools / 13 resources, segmentation export, layout overlay, OCR preprocessing, knowledge graph (LightRAG)

5K 0 0
jztan
pdf-mcp

MCP server that lets Claude Code and other AI agents read large PDFs without hitting context limits. Chunked reading, hybrid search, OCR, table and image extraction, SQLite cache.

4K 25 4
martin-papy
qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

3K 39 24
yuvaraj3855
preocr

Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.

3K 10 4
axzml
pdfmark-ai

Convert PDF files to high-quality Markdown using LLM vision models

3K 0 0
cdklabs
appmod-catalog-blueprints

Library of composable well-architected CDK solution blueprints for real-world use cases — AI agents, document processing, serverless applications and more. Deploy or customize on AWS in minutes.

3K 11 1
docling-project
docling-graph

Transform unstructured documents into validated, rich and queryable knowledge graphs.

3K 139 21
MantisAI
sieves

Plug-and-play document AI with zero-shot models.

2K 124 8
martin-papy
qdrant-loader-mcp-server

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

2K 39 24
martin-papy
qdrant-loader-core

Shared core for provider-agnostic LLM support and configuration mapping for qdrant-loader ecosystem

2K 39 24
r-uben
socr

Multi-engine document OCR with cascading fallback

2K 2 0
KimSeogyu
dm2xcod

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

2K 4 0
ceratops-code
pdf-form-tools

Template-aware tools for filling scanned PDF forms with visual verification

2K 0 0
Mellow-Artificial-Intelligence
openextract

Extract structured data from documents, images, audio, and video using LLMs.

1K 16 2
r-uben
deepseek-ocr-cli

CLI tool for OCR using DeepSeek-OCR model via Ollama

942 11 3
oguzhankir
omnichunk

Structure-aware text chunking library for code, prose, and markup files. Intelligently splits files into context-rich chunks while preserving semantic boundaries. Supports 15+ programming languages, deterministic output, and zero external dependencies. Perfect for RAG systems, code analysis, and LLM context optimization.

898 8 0
yfedoseev
office-oxide

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

878 8 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery