Pdf To Markdown Python Packages

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

81K 717 78

research-pipeline

Deterministic stage-based pipeline for searching, screening, downloading, converting, and summarizing academic papers. CLI + MCP server.

33K 1 0

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

4K 62 6

docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 129

mathcraft-ocr

A Windows math workspace for screenshot OCR, handwriting-to-LaTeX, editing, preview, and symbolic computation, powered by MathCraft OCR and MathLive.

2K 155 14

olgadoc

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

2K 6 0

vision-parse

Parse PDF documents into markdown formatted content using Vision LLMs

2K 469 66

file2txt

file2txt is a Python library takes common file formats and turns them into plain text (a txt file) with Markdown styling.

1K 12 2

churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

1K 38 4

markdrop

A comprehensive PDF processing toolkit that converts PDFs to markdown with advanced AI-powered features for image and table analysis. Supports local files and URLs, preserves document structure, extracts high-quality images, detects tables using advanced ML models, and generates detailed content descriptions using multiple LLM providers including OpenAI GPT-4o, Google Gemini, Anthropic Claude, Groq, OpenRouter, and LiteLLM.

974 202 18

llm-data-converter

Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract

900 7 1

wisup-e2m

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

839 1K 72

llm-food

Serving files for hungry LLMs

583 25 0

anything2md

Python package and CLI for converting documents to Markdown using Cloudflare Workers AI toMarkdown.

321 1 0

document-data-extractor

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

254 7 1

paperflow-postprocess

Open-source PDF-to-Markdown post-processor with footnotes, LaTeX normalization, figure links, and YAML metadata. Supports Marker, MinerU, PyMuPDF, and Docling. Includes a self-hosted web UI.

215 19 2

smart-llm-loader

A powerful PDF processing toolkit that seamlessly integrates with LLMs for intelligent document chunking and RAG applications. Features smart context-aware segmentation, multi-LLM support, and optimized content extraction for enhanced RAG performance.

105 76 3

markdownbridge

Python SDK for the MarkdownBridge OCR API — convert documents and images to Markdown

93 0 0

credeed-pdf-to-markdown

Convert PDF to Markdown using AI, can be used for Agent to understand documents.

73 0 0

multimodal-parser

Parse PDFs into markdown using Vision LLMs

1 465 66

Search Packages