PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
chatnoir-eu
fastwarc

A robust web archive analytics toolkit

1.3M 137 18
chatnoir-eu
resiliparse

A robust web archive analytics toolkit

1.2M 137 18
webrecorder
warcio

Streaming WARC/ARC library for fast web archive IO

1.2M 456 69
oduwsdl
ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

7K 650 41
ArchiveBox
archivebox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

6K 27K 2K
webrecorder
cdxj-indexer

CDXJ Indexing of WARC/ARCs

4K 34 15
cocrawler
cdx-toolkit

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

4K 204 34
mikwielgus
forum-dl

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

1K 116 6
openzim
warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format

1K 83 7
WEB-CHILD
internet-archive-extractor

Tool for extracting archived web sites from the Internet Archive saving as WARC files.

497 1 0
q-m
scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

390 9 1
internetarchive
scrapy-warcio

Support for writing WARC files with Scrapy

385 24 6
Florents-Tselai
warcdb

WarcDB: Web crawl data as SQLite databases

381 404 10
oduwsdl
otmt

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

275 9 3
cocrawler
cocrawler

CoCrawler is a versatile web crawler built using modern tools and concurrency.

245 194 25
internetarchive
cdxsummary

Summarize web archive capture index (CDX) files

189 88 29
ArchiveBox
archivebox-likn

The self-hosted internet archive.

132 27K 2K
datacoon
metawarc

metawarc: a command-line tool for data extraction from WARC files (web archives)

131 35 2
    • Data from PyPI, GitHub, ClickHouse, and BigQuery