43 dependents
| Package | Description | Downloads/month |
|---|---|---|
| news-please - an integrated web crawler and information extractor for news that ... | 118K | |
| Cleaning tool for web scraped text | 55K | |
| Core Python Web Archiving Toolkit for replay and recording of web archives | 11K | |
| Vibe Automation | 8K | |
| InterPlanetary Wayback: A distributed and persistent archive replay system using... | 7K | |
| Automatically archive links to videos, images, and social media content from Goo... | 5K | |
| Data browser based on s3. 一个基于 S3 的数据(json / jsonl / parquet / html / md等)可视化工具。... | 5K | |
| CDXJ Indexing of WARC/ARCs | 4K | |
| 📜 The Archive Query Log. | 3K | |
| Common crawl extractor | 2K | |
| Common Crawl import support for Meshagent datasets | 1K | |
| MCP server for search and retrieval of web crawler content | 1K | |
| Convert HTTP Archive (HAR) -> Web Archive (WARC) format | 1K | |
| Create BagIt packages harvesting data from upstream sources | 1K | |
| Toolkit for the Målfrid project | 1K | |
| Command line tool to convert a file in the WARC format to a file in the ZIM form... | 1K | |
| A crawler program to download content from portals (news, forums, blogs) and con... | 1K | |
| A tool for preserving email in multiple preservation formats. | 825 | |
| 🗃️ Unified access to web archive CDX and Memento APIs. | 622 | |
| Tool for extracting archived web sites from the Internet Archive saving as WARC ... | 497 | |
| warc2summary | 441 | |
| 💾 Scalable and easy WARC records storage on S3. | 404 | |
| A plugin for Scrapy that allows users to capture and export web archives in the ... | 390 | |
| WarcDB: Web crawl data as SQLite databases | 381 | |
| 💾 Easy WARC records disk cache. | 381 | |
| Map the HTML schema of portals to valid TEI XML with the tags and structures use... | 350 | |
| a tool to find archived web pages from different websites using multiple differe... | 318 | |
| Library and command line tool for WARC file reporting and processing | 291 | |
| CoCrawler is a versatile web crawler built using modern tools and concurrency. | 245 | |
| Convert Directories, Files and ZIP Files to Web Archives (WARC) | 204 | |
| Scalable Data Preprocessing Tool for Training Large Language Models | 185 | |
| MCP server for search and retrieval of web crawler content | 175 | |
| The base for commoncrawl analysis based on sparkcc | 153 | |
| Save website to WARC using Google Chrome. | 150 | |
| Scrape Marginados data from a WARC file | 121 | |
| Warc2graph extracts a graph data structure from WARC files. | 90 | |
| A petabyte scale data processing framework for AI models using Ray. | 79 | |
| Common Crawl data processing examples for PySpark. | 77 | |
| Scalable data pre processing and curation toolkit for LLMs | 71 | |
| MCP server for search and retrieval of web crawler content | 56 | |
| Scalable Data Preprocessing Tool for Training Large Language Models | 1 | |
| Data browser based on s3. 一个基于 S3 的数据(json / jsonl / parquet / html / md等)可视化工具。... | 1 | |
| 1 |