43 dependents
Package Description Downloads/month
news-please - an integrated web crawler and information extractor for news that ... 118K
Cleaning tool for web scraped text 55K
Core Python Web Archiving Toolkit for replay and recording of web archives 11K
Vibe Automation 8K
InterPlanetary Wayback: A distributed and persistent archive replay system using... 7K
Automatically archive links to videos, images, and social media content from Goo... 5K
Data browser based on s3. 一个基于 S3 的数据(json / jsonl / parquet / html / md等)可视化工具。... 5K
CDXJ Indexing of WARC/ARCs 4K
📜 The Archive Query Log. 3K
Common crawl extractor 2K
Common Crawl import support for Meshagent datasets 1K
MCP server for search and retrieval of web crawler content 1K
Convert HTTP Archive (HAR) -> Web Archive (WARC) format 1K
Create BagIt packages harvesting data from upstream sources 1K
Toolkit for the Målfrid project 1K
Command line tool to convert a file in the WARC format to a file in the ZIM form... 1K
A crawler program to download content from portals (news, forums, blogs) and con... 1K
A tool for preserving email in multiple preservation formats. 825
🗃️ Unified access to web archive CDX and Memento APIs. 622
Tool for extracting archived web sites from the Internet Archive saving as WARC ... 497
warc2summary 441
💾 Scalable and easy WARC records storage on S3. 404
A plugin for Scrapy that allows users to capture and export web archives in the ... 390
WarcDB: Web crawl data as SQLite databases 381
💾 Easy WARC records disk cache. 381
Map the HTML schema of portals to valid TEI XML with the tags and structures use... 350
a tool to find archived web pages from different websites using multiple differe... 318
Library and command line tool for WARC file reporting and processing 291
CoCrawler is a versatile web crawler built using modern tools and concurrency. 245
Convert Directories, Files and ZIP Files to Web Archives (WARC) 204
Scalable Data Preprocessing Tool for Training Large Language Models 185
MCP server for search and retrieval of web crawler content 175
The base for commoncrawl analysis based on sparkcc 153
Save website to WARC using Google Chrome. 150
Scrape Marginados data from a WARC file 121
Warc2graph extracts a graph data structure from WARC files. 90
A petabyte scale data processing framework for AI models using Ray. 79
Common Crawl data processing examples for PySpark. 77
Scalable data pre processing and curation toolkit for LLMs 71
MCP server for search and retrieval of web crawler content 56
Scalable Data Preprocessing Tool for Training Large Language Models 1
Data browser based on s3. 一个基于 S3 的数据(json / jsonl / parquet / html / md等)可视化工具。... 1
1