Warc Python Packages

fastwarc

A robust web archive analytics toolkit

1.3M 137 18

resiliparse

A robust web archive analytics toolkit

1.2M 137 18

warcio

Streaming WARC/ARC library for fast web archive IO

1.2M 456 69

ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

7K 650 41

archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

6K 27K 2K

cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15

cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 204 34

forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

1K 116 6

warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

1K 83 7

internet-archive-extractor

Tool for extracting archived web sites from the Internet Archive saving as WARC files.

497 1 0

scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

390 9 1

scrapy-warcio

Support for writing WARC files with Scrapy

385 24 6

warcdb

WarcDB: Web crawl data as SQLite databases

381 404 10

otmt

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

275 9 3

cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

245 194 25

cdxsummary

Summarize web archive capture index (CDX) files

189 88 29

archivebox-likn

The self-hosted internet archive.

132 27K 2K

metawarc

metawarc: a command-line tool for data extraction from WARC files (web archives)

131 35 2

Search Packages