Documents Python Packages

jsonpath-ng

Finally, a JSONPath implementation for Python that aims to be standard compliant. That's all. Enjoy!

77.3M 730 112

docling

Get your documents ready for gen AI

6M 59K 4K

svglib

Read SVG files and convert them to other formats.

4.5M 362 85

docling-slim

Get your documents ready for gen AI

206K 59K 4K

topk-sdk

Provide the right context to your agents.

75K 70 3

signnow-python-sdk

Official SignNow SDK for Python. Sign documents, request eSignatures, and build role-based multi-signer workflows via REST API.

72K 12 7

passporteye

Extraction of machine-readable zone information from passports, visas and id-cards via OCR

13K 446 122

boxdetect

BoxDetect is a Python package based on OpenCV which allows you to easily detect rectangular shapes like character or checkbox boxes on scanned forms.

12K 113 21

organisingfiles-by-type

This repo features File Organising by Type of Files!.This repo uses python to Organise Files so that users can care about doing stuff they want to instead of the tedious new_folder,copy,cut,paste.It is also a good way to not loose your files in the messy file heapes!

3K 1 0

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

2K 661 52

doc-redaction

Redact PDF/image-based documents, Word, or CSV/XLSX files using a graphical user interface. Demo: https://huggingface.co/spaces/seanpedrickcase/document_redaction or with try with VLMs: https://huggingface.co/spaces/seanpedrickcase/document_redaction_vlm

2K 50 10

olgadoc

Python bindings for Olga. PDF, DOCX, XLSX, HTML → Markdown and typed JSON, 15–40× faster than equivalent-quality OSS. Strictly-typed surface, no Any, one abi3 wheel for CPython 3.8+.

2K 6 0

docler

Abstractions & Tools for OCR / document processing

2K 5 2

oldp

Open Legal Data Platform

1K 137 24

churro-ocr

CHURRO is an OCR toolkit for historical document transcription, built to make handwritten and printed sources readable at high accuracy and lower cost.

1K 38 4

strutex

strutex is a Python library designed to extract JSON from documents .

826 10 0

pullcite

Evidence-backed structured extraction. Pull data from documents with proof of where each value came from.

705 1 0

pr2md

PR2MD is a powerful command-line tool that extracts GitHub Pull Request and Issue data and converts it into comprehensive, well-formatted Markdown documents. Perfect for documentation, archiving, code reviews, or offline analysis of pull requests.

703 1 0

docowling

SDK and CLI for parsing PDF, DOCX, HTML, and more, to a unified document representation for powering downstream workflows such as gen AI applications.

646 3 1