Web Archiving Python Packages

waybackpy

Wayback Machine API interface & a command-line tool

2.6M 575 41

warcio

Streaming WARC/ARC library for fast web archive IO

1.2M 456 69

pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

12K 2K 239

ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

6K 650 41

archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

5K 27K 2K

auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

5K 1K 100

cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 204 34

cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15

wayback-archive

A comprehensive tool for downloading and archiving websites from the Wayback Machine

1K 8 3

farchive

Local content-addressed archive with observation history. Stores bytes by SHA-256, preserves locator state as contiguous spans, compresses with zstd and corpus-trained dictionaries. SQLite-backed.

915 6 0

fatcat-openapi-client

API client library for fatcat.wiki (a bibliographic catalog)

577 121 18

hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, replay, mirroring, data scraping, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

572 120 10

warcdb

WarcDB: Web crawl data as SQLite databases

411 404 10

scrapy-warcio

Support for writing WARC files with Scrapy

398 24 6

eprints2archives

Send EPrints URLs to the Internet Archive and other archives

341 4 0

wayback-diff

Intelligent web page comparison tool with Wayback Machine support and visual regression testing

283 1 0

hoardy-web-sas

207 120 10

archivebox-likn

The self-hosted internet archive.

139 27K 2K

pywayback

Core Python Web Archiving Toolkit for replay and recording of web archives

1 2K 239