Tesseract Python Packages

pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

77.9M 10K 718

pymupdfb

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

4.3M 10K 718

ocrmypdf

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

835K 34K 2K

tesserocr

A Python wrapper for the tesseract-ocr API

365K 2K 259

kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

165K 8K 472

tagui

Python package for doing RPA

12K 5K 724

rpa

Python package for doing RPA

11K 5K 724

pysseract

Python binding to Tesseract 4.0 API

4K 1 1

fastmrz

⚡Extracting the Machine Readable Zone (MRZ) from passport or any document images

4K 176 39

aiopytesseract

A Python asyncio wrapper for Tesseract-OCR.

2K 27 7

anyocr

A lightweight, unified OCR toolkit with a one-liner API. Supports Surya, EasyOCR, PaddleOCR, Tesseract, and Vision LLMs through a single interface.

2K 0 0

screen-ocr

Easily perform OCR on portions of the screen, choosing from a selection of backends.

697 50 7

nkocr

This is a module to make specifics OCRs at food products and nutricional tables.

628 39 11

pdf2dataset

Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features

596 19 5

privacy-mask

Automatically redacts sensitive data in screenshots before sending to AI agents

594 9 1

axa-fr-ocr

AXA France OCR library

462 3 0

medical-ocr

Multi-engine OCR pipeline for medical and legal documents

419 1 0

northern-lights-forecast

A northern lights forecast that automatically send a telegram notification during substorm events.

389 1 0

tesseract-jax

Execute + differentiate Tesseracts as part of JAX programs, with full support for function transformations like JIT, grad, and more. ⚡

357 31 3

readmrz

Machine readable zone reader on ID cards

346 19 10

pyspark-pdf

PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it

291 81 4

ocyara

A Yara rule engine that scans images for matches using Optical Character Recognition (OCR). See the Github page for more information about the Cython, Tesseract, and Leptonica prerequsites.

267 42 8

multitessiocr

Performs a very fast OCR on a list of images (file path, url, base64, bytes, numpy, PIL ...) using Tesseract and returns the recognized text, its coordinates, and line-based word grouping in a DataFrame.

243 0 0

a-pandas-ex-tesseract-multirow-regex-fuzz

Regex/Fuzz search across multiple rows/Tesseract to pandas.DataFrame

239 0 0

Search Packages