Pdf To Markdown Python Packages

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

92K 717 78

research-pipeline

Deterministic stage-based pipeline for searching, screening, downloading, converting, and summarizing academic papers. CLI + MCP server.

34K 1 0

pdfmux

PDF extraction that checks its own work. #2 reading order accuracy — zero AI, zero GPU, zero cost.

3K 62 6

olgadoc

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

2K 6 0

mathcraft-ocr

A Windows math workspace for screenshot OCR, handwriting-to-LaTeX, editing, preview, and symbolic computation, powered by MathCraft OCR and MathLive.

2K 155 14

docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

2K 1K 129

vision-parse

Parse PDF documents into markdown formatted content using Vision LLMs

2K 469 66

file2txt

file2txt is a Python library takes common file formats and turns them into plain text (a txt file) with Markdown styling.

1K 12 2

llm-data-converter

Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract

1K 7 1

churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

948 38 4

markdrop

A comprehensive PDF processing toolkit that converts PDFs to markdown with advanced AI-powered features for image and table analysis. Supports local files and URLs, preserves document structure, extracts high-quality images, detects tables using advanced ML models, and generates detailed content descriptions using multiple LLM providers including OpenAI GPT-4o, Google Gemini, Anthropic Claude, Groq, OpenRouter, and LiteLLM.

941 202 18

wisup-e2m

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

868 1K 72

llm-food

Serving files for hungry LLMs

690 25 0

anything2md

Python package and CLI for converting documents to Markdown using Cloudflare Workers AI toMarkdown.

352 1 0

document-data-extractor

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

269 7 1

paperflow-postprocess

Open-source PDF-to-Markdown post-processor with footnotes, LaTeX normalization, figure links, and YAML metadata. Supports Marker, MinerU, PyMuPDF, and Docling. Includes a self-hosted web UI.

231 19 2

smart-llm-loader

A powerful PDF processing toolkit that seamlessly integrates with LLMs for intelligent document chunking and RAG applications. Features smart context-aware segmentation, multi-LLM support, and optimized content extraction for enhanced RAG performance.

132 76 3

markdownbridge

Python SDK for the MarkdownBridge OCR API — convert documents and images to Markdown

105 0 0

credeed-pdf-to-markdown

Convert PDF to Markdown using AI, can be used for Agent to understand documents.

80 0 0

multimodal-parser

Parse PDFs into markdown using Vision LLMs

1 465 66