Data Extraction Python Packages

firecrawl-py

🔥 The API to search, scrape, and interact with the web for AI

7M 114K 7K

flashtext

Extract Keywords from sentence or Replace keywords in sentences.

2.3M 6K 598

firecrawl

🔥 The API to search, scrape, and interact with the web for AI

746K 114K 7K

vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

613K 1K 275

scrapling

🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!

612K 47K 4K

scrapfly-sdk

Official Python SDK for the Scrapfly platform: web scraping, screenshots, AI extraction, crawling, and a remote anti-bot browser. Integrates with Scrapy, LlamaIndex, and LangChain.

311K 55 15

pdf-oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

92K 717 78

recipe-scrapers

Python package for scraping recipes data

86K 2K 643

amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

72K 490 91

stealth-requests

Undetected web-scraping & seamless HTML parsing in Python!

57K 467 48

jsonpath-extractor

A query expression for extracting data from JSON.

17K 41 4

contextgem

ContextGem: Effortless LLM extraction from documents

9K 2K 155

markcrawl

Fast Python web crawler for RAG and AI ingestion. Extracts clean Markdown from any site for LLMs and vector stores.

8K 2 0

cyac

High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!

7K 94 15

vnstock3

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

7K 1K 275

optimuspyspark

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

6K 2K 232

scaledp

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

5K 18 1

hext

A module and command-line utility to extract structured data from HTML

5K 55 3

scrape-cli

Extract HTML elements from the command line using CSS selectors or XPath. Pipe-friendly Python CLI.

5K 26 1

crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

3K 71 5

wiktionary-de-parser

Extracts data from German Wiktionary dump files.

3K 26 8

ytfetcher

⚡ Build structured YouTube datasets at scale — effortlessly fetch transcripts and rich metadata for NLP, ML, and AI workflows.

2K 70 11

scout-intel-mcp

The Google for AI agents — ask Claude to research any company, analyze competitors, track market trends, and score data quality. 6 intelligence tools, 5+ data sources (DuckDuckGo, NewsAPI, Wikipedia, web scraping), confidence-scored structured JSON. pip install scout-intel-mcp

2K 0 0

taprun

Automate any website. AI compiles it. Runs forever at $0. 200+ skills, 3 runtimes, MCP native.

2K 3 1