Pdf Extractor Rag Python Packages

paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

2M 77K 10K

mineru

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

282K 62K 5K

magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

77K 62K 5K

mineru-selfhosted-mcp

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

3K 62K 5K

fadoudou2

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

535 77K 10K

je-paddleocr

Awesome OCR toolkits based on PaddlePaddle(8.6M ultra-lightweight pre-trained model, support training and deployment among server, mobile, embedded and IoT devices)

214 77K 10K

langchain-paddleocr

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

192 77K 10K

xh-pdf-parser

A practical tool for converting PDF to Markdown

167 62K 5K

ppocrlabel-japan

PPOCRLabelv2 is a semi-automatic graphic annotation tool suitable for OCR field, with built-in PP-OCR model to automatically detect and re-recognize data. It is written in Python3 and PyQT5, supporting rectangular box, table, irregular text and key information annotation modes. Annotations can be directly used for the training of PP-OCR detection and recognition models.

161 77K 10K

paddleocrwordleveldetection

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

153 77K 10K

ragtable-extract

PDF table extraction for RAG and LLM — convert PDF tables to clean HTML. Fast, local, no GPU. Handles merged cells, line-wrapped text, no serialization.

125 1 0

lazyllm-magic-pdf

Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your Agentic workflows.

60 62K 5K

fadoudou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

3 77K 10K

paddleocr-fagougou

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.

1 77K 10K

Search Packages