PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
J535D165
recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python

4.6M 1K 153
moj-analytical-services
splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

718K 2K 234
borgbackup
borgbackup

Deduplicating archiver with compression and authenticated encryption.

72K 13K 843
iscc
fastcdc

FastCDC implementation in Python https://pypi.org/project/fastcdc/

72K 64 18
MinishLab
semhash

Fast Multimodal Semantic Deduplication & Filtering

53K 919 56
opensanctions
nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

35K 239 43
LibreTranslate
removedup

Remove duplicates from parallel corpora

6K 7 1
Fallen-Breath
pyfastcdc

A high-performance FastCDC 2020 implementation written in Python + Cython

6K 2 0
benzsevern
goldenmatch

🟡 Golden Suite — polyglot data-quality + entity-resolution toolkit. GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch dedupes → GoldenPipe orchestrates. Zero-config defaults, 97% F1, MCP server per package + one master, multi-arch container images, drop-in Airflow DAGs.

5K 36 5
zinggAI
zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

4K 1K 165
jolovicdev
cashet

A Python memoization cache with Redis, async support, and an HTTP server.Cache Python function results like git objects. Content-addressable, pipeline-friendly, and CLI-inspectable. Run once, reuse forever.

4K 0 0
DBRetina
dbretina

DBRetina Python Package

4K 1 3
lumen-argus
crossfire-rules

Regex rule overlap analyzer for DLP, secret scanning, SAST, and IDS tools

4K 0 0
cansarigol
mailmap-checker

Pre-commit hook that checks and maintains .mailmap completeness

3K 2 0
Elijas
redis-message-queue

Reliable Python message queuing with Redis and built-in deduplication. Publish once, process once, recover from crashes - across any number of producers and consumers.

3K 5 1
vaultah
replicat

Configurable and lightweight backup utility with deduplication and encryption.

3K 5 0
AI-team-UoA
pyjedai

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.

2K 93 13
fritshermans
deduplipy

Python package for deduplication/entity resolution using active learning

2K 82 8
sebastienrousseau
bankstatementparser

Parse bank statements across CAMT, PAIN.001, CSV, OFX/QFX, MT940, and PDFs (digital + scanned) into unified Transaction models. Deterministic ISO 20022 parsers, LLM fallback for PDFs, vision for scans, balance verification, categorization, and interactive review mode.

2K 19 5
AshleyT3
atbu-pkg

ATBU Cloud/Local Backup & File Integrity/Duplication Management Utility

2K 1 0
zevatov
nra

🧬 The 21st Century Data Format for AI. CDC deduplication, Zero-Download cloud streaming, O(1) random access. Built in Rust.

2K - -
yaroslaff
hashget

deduplication tool for archiving data with extremely high ratio

1K 7 2
kdeldycke
mail-deduplicate

📧 CLI to deduplicate mails from mail boxes

1K 196 42
NickCrews
mismo

The SQL/Ibis powered sklearn of record linkage

1K 23 4
    • Data from PyPI, GitHub, ClickHouse, and BigQuery