PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Web Archiving Python Packages

Python packages with the GitHub topic web-archiving. Sorted by relevance, with stars and monthly downloads.
akamhy
waybackpy

Wayback Machine API interface & a command-line tool

2.6M 575 41
webrecorder
warcio

Streaming WARC/ARC library for fast web archive IO

1.2M 456 69
webrecorder
pywb

Core Python Web Archiving Toolkit for replay and recording of web archives

12K 2K 239
oduwsdl
ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

6K 650 41
ArchiveBox
archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

5K 27K 2K
bellingcat
auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

5K 1K 100
cocrawler
cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 204 34
webrecorder
cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15
GeiserX
wayback-archive

A comprehensive tool for downloading and archiving websites from the Wayback Machine

1K 8 3
eliask
farchive

Local content-addressed archive with observation history. Stores bytes by SHA-256, preserves locator state as contiguous spans, compresses with zstd and corpus-trained dictionaries. SQLite-backed.

915 6 0
internetarchive
fatcat-openapi-client

API client library for fatcat.wiki (a bibliographic catalog)

577 121 18
Own-Data-Privateer
hoardy-web

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, replay, mirroring, data scraping, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

572 120 10
Florents-Tselai
warcdb

WarcDB: Web crawl data as SQLite databases

411 404 10
internetarchive
scrapy-warcio

Support for writing WARC files with Scrapy

398 24 6
caltechlibrary
eprints2archives

Send EPrints URLs to the Internet Archive and other archives

341 4 0
GeiserX
wayback-diff

Intelligent web page comparison tool with Wayback Machine support and visual regression testing

283 1 0
Own-Data-Privateer
hoardy-web-sas

Passively capture, archive, and hoard your web browsing history, including the contents of the pages you visit, for later offline viewing, replay, mirroring, data scraping, and/or indexing. Your own personal private Wayback Machine that can also archive HTTP POST requests and responses, as well as most other HTTP-level data.

207 120 10
ArchiveBox
archivebox-likn

The self-hosted internet archive.

139 27K 2K
ikreymer
pywayback

Core Python Web Archiving Toolkit for replay and recording of web archives

1 2K 239
    • Data from PyPI, GitHub, ClickHouse, and BigQuery