Web Archiving Python Packages

waybackpy

Wayback Machine API interface & a command-line tool

2.6M 575 41

warcio

Streaming WARC/ARC library for fast web archive IO

1.2M 456 69

pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

11K 2K 239

ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

7K 650 41

archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

6K 27K 2K

auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

5K 1K 100

cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15

cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 204 34

wayback-archive

A comprehensive tool for downloading and archiving websites from the Wayback Machine

2K 8 3

farchive

Local content-addressed archive with observation history. Stores bytes by SHA-256, preserves locator state as contiguous spans, compresses with zstd and corpus-trained dictionaries. SQLite-backed.

885 6 0

wayback-diff

Intelligent web page comparison tool with Wayback Machine support and visual regression testing

716 1 0

hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, replay, mirroring, data scraping, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

599 120 10

fatcat-openapi-client

API client library for fatcat.wiki (a bibliographic catalog)

593 121 18

eprints2archives

Send EPrints URLs to the Internet Archive and other archives

398 4 0

scrapy-warcio

Support for writing WARC files with Scrapy

385 24 6

warcdb

WarcDB: Web crawl data as SQLite databases

381 404 10

hoardy-web-sas

204 120 10

archivebox-likn

The self-hosted internet archive.

132 27K 2K

pywayback

Core Python Web Archiving Toolkit for replay and recording of web archives

1 2K 239

Search Packages