PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Data Extraction Python Packages

Python packages with the GitHub topic data-extraction. Sorted by relevance, with stars and monthly downloads.
firecrawl
firecrawl-py

🔥 The API to search, scrape, and interact with the web for AI

7M 114K 7K
vi3k6i5
flashtext

Extract Keywords from sentence or Replace keywords in sentences.

2.3M 6K 598
firecrawl
firecrawl

🔥 The API to search, scrape, and interact with the web for AI

746K 114K 7K
thinh-vu
vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

613K 1K 275
D4Vinci
scrapling

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

612K 47K 4K
scrapfly
scrapfly-sdk

Official Python SDK for the Scrapfly platform: web scraping, screenshots, AI extraction, crawling, and a remote anti-bot browser. Integrates with Scrapy, LlamaIndex, and LangChain.

311K 55 15
yfedoseev
pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

92K 717 78
hhursev
recipe-scrapers

Python package for scraping recipes data

86K 2K 643
a-maliarov
amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

72K 490 91
jpjacobpadilla
stealth-requests

Undetected web-scraping & seamless HTML parsing in Python!

57K 467 48
linw1995
jsonpath-extractor

A query expression for extracting data from JSON.

17K 41 4
shcherbak-ai
contextgem

ContextGem: Effortless LLM extraction from documents

9K 2K 155
AIMLPM
markcrawl

Fast Python web crawler for RAG and AI ingestion. Extracts clean Markdown from any site for LLMs and vector stores.

8K 2 0
nppoly
cyac

High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!

7K 94 15
thinh-vu
vnstock3

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

7K 1K 275
ironmussa
optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

6K 2K 232
StabRise
scaledp

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

5K 18 1
html-extract
hext

A module and command-line utility to extract structured data from HTML

5K 55 3
aborruso
scrape-cli

Extract HTML elements from the command line using CSS selectors or XPath. Pipe-friendly Python CLI.

5K 26 1
us
crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

3K 71 5
gambolputty
wiktionary-de-parser

Extracts data from German Wiktionary dump files.

3K 26 8
kaya70875
ytfetcher

⚡ Build structured YouTube datasets at scale — effortlessly fetch transcripts and rich metadata for NLP, ML, and AI workflows.

2K 70 11
omniologynow-rgb
scout-intel-mcp

The Google for AI agents — ask Claude to research any company, analyze competitors, track market trends, and score data quality. 6 intelligence tools, 5+ data sources (DuckDuckGo, NewsAPI, Wikipedia, web scraping), confidence-scored structured JSON. pip install scout-intel-mcp

2K 0 0
LeonTing1010
taprun

Automate any website. AI compiles it. Runs forever at $0. 200+ skills, 3 runtimes, MCP native.

2K 3 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery