Html Parser Python Packages

justext

Heuristic based boilerplate removal tool

6.1M 818 89

sec-parser

Parse SEC EDGAR HTML documents into a tree of elements that correspond to the visual (semantic) structure of the document.

92K 282 78

fast-scrape

🦀 High-performance HTML parsing library. Rust core with native bindings for Python, Node.js & WASM. SIMD-accelerated, memory-safe, consistent API everywhere.

11K 5 0

advancedhtmlparser

Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.

8K 101 25

pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

6K 640 117

pyxml3

Pure python3 alternative to stdlib xml.etree with HTML support

4K 1 1

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

2K 661 52

yirabot

YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

1K 17 0