Dependents of warcio

43 dependents

Package	Description	Downloads/month
news-please	news-please - an integrated web crawler and information extractor for news that ...	118K
pyplexity	Cleaning tool for web scraped text	55K
pywb	Core Python Web Archiving Toolkit for replay and recording of web archives	11K
vibe-automation	Vibe Automation	8K
ipwb	InterPlanetary Wayback: A distributed and persistent archive replay system using...	7K
auto-archiver	Automatically archive links to videos, images, and social media content from Goo...	5K
vis3	Data browser based on s3. 一个基于 S3 的数据（json / jsonl / parquet / html / md等）可视化工具。...	5K
cdxj-indexer	CDXJ Indexing of WARC/ARCs	4K
archive-query-log	📜 The Archive Query Log.	3K
cmoncrawl	Common crawl extractor	2K
meshagent-commoncrawl	Common Crawl import support for Meshagent datasets	1K
mcp-server-webcrawl	MCP server for search and retrieval of web crawler content	1K
har2warc	Convert HTTP Archive (HAR) -> Web Archive (WARC) format	1K
bagit-create	Create BagIt packages harvesting data from upstream sources	1K
maalfrid-toolkit	Toolkit for the Målfrid project	1K
warc2zim	Command line tool to convert a file in the WARC format to a file in the ZIM form...	1K
webarticlecurator	A crawler program to download content from portals (news, forums, blogs) and con...	1K
mailbagit	A tool for preserving email in multiple preservation formats.	825
web-archive-api	🗃️ Unified access to web archive CDX and Memento APIs.	622
internet-archive-extractor	Tool for extracting archived web sites from the Internet Archive saving as WARC ...	497
warc2summary	warc2summary	441
warc-s3	💾 Scalable and easy WARC records storage on S3.	404
scrapy-webarchive	A plugin for Scrapy that allows users to capture and export web archives in the ...	390
warcdb	WarcDB: Web crawl data as SQLite databases	381
warc-cache	💾 Easy WARC records disk cache.	381
html2tei	Map the HTML schema of portals to valid TEI XML with the tags and structures use...	350
web-archive-get	a tool to find archived web pages from different websites using multiple differe...	318
lockss-warcread	Library and command line tool for WARC file reporting and processing	291
cocrawler	CoCrawler is a versatile web crawler built using modern tools and concurrency.	245
warcit	Convert Directories, Files and ZIP Files to Web Archives (WARC)	204
invisible-rabbit	Scalable Data Preprocessing Tool for Training Large Language Models	185
iflow-mcp-pragmar-mcp-server-webcrawl	MCP server for search and retrieval of web crawler content	175
sp-ccrawl	The base for commoncrawl analysis based on sparkcc	153
crocoite	Save website to WARC using Google Chrome.	150
marginados-warc-scraper	Scrape Marginados data from a WARC file	121
warc2graph	Warc2graph extracts a graph data structure from WARC files.	90
seidr-data	A petabyte scale data processing framework for AI models using Ray.	79
cc-pyspark	Common Crawl data processing examples for PySpark.	77
invisible-unicorn	Scalable data pre processing and curation toolkit for LLMs	71
mseep-mcp-server-webcrawl	MCP server for search and retrieval of web crawler content	56
lava-ray	Scalable Data Preprocessing Tool for Training Large Language Models	1
ovisu	Data browser based on s3. 一个基于 S3 的数据（json / jsonl / parquet / html / md等）可视化工具。...	1
delphai-company-pages		1