PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Deduplication Python Packages

Python packages with the GitHub topic deduplication. Sorted by relevance, with stars and monthly downloads.
J535D165
recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python

4.6M 1K 153
moj-analytical-services
splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

740K 2K 234
borgbackup
borgbackup

Deduplicating archiver with compression and authenticated encryption.

76K 13K 843
iscc
fastcdc

FastCDC implementation in Python https://pypi.org/project/fastcdc/

72K 64 18
MinishLab
semhash

Fast Multimodal Semantic Deduplication & Filtering

52K 919 56
opensanctions
nomenklatura

Framework and command-line tools for integrating FollowTheMoney data streams from multiple sources

38K 239 43
LibreTranslate
removedup

Remove duplicates from parallel corpora

6K 7 1
Fallen-Breath
pyfastcdc

A high-performance FastCDC 2020 implementation written in Python + Cython

6K 2 0
benzsevern
goldenmatch

🟡 Golden Suite — polyglot data-quality + entity-resolution toolkit. GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch dedupes → GoldenPipe orchestrates. Zero-config defaults, 97% F1, MCP server per package + one master, multi-arch container images, drop-in Airflow DAGs.

5K 36 5
zinggAI
zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

5K 1K 165
jolovicdev
cashet

A Python memoization cache with Redis, async support, and an HTTP server.Cache Python function results like git objects. Content-addressable, pipeline-friendly, and CLI-inspectable. Run once, reuse forever.

5K 0 0
lumen-argus
crossfire-rules

Regex rule overlap analyzer for DLP, secret scanning, SAST, and IDS tools

4K 0 0
DBRetina
dbretina

DBRetina Python Package

4K 1 3
cansarigol
mailmap-checker

Pre-commit hook that checks and maintains .mailmap completeness

3K 2 0
Elijas
redis-message-queue

Reliable Python message queuing with Redis and built-in deduplication. Publish once, process once, recover from crashes - across any number of producers and consumers.

3K 5 1
vaultah
replicat

Configurable and lightweight backup utility with deduplication and encryption.

3K 5 0
AI-team-UoA
pyjedai

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.

2K 93 13
sebastienrousseau
bankstatementparser

Parse bank statements across CAMT, PAIN.001, CSV, OFX/QFX, MT940, and PDFs (digital + scanned) into unified Transaction models. Deterministic ISO 20022 parsers, LLM fallback for PDFs, vision for scans, balance verification, categorization, and interactive review mode.

2K 19 5
fritshermans
deduplipy

Python package for deduplication/entity resolution using active learning

2K 82 8
AshleyT3
atbu-pkg

ATBU Cloud/Local Backup & File Integrity/Duplication Management Utility

2K 1 0
zevatov
nra

🧬 The 21st Century Data Format for AI. CDC deduplication, Zero-Download cloud streaming, O(1) random access. Built in Rust.

2K - -
kdeldycke
mail-deduplicate

📧 CLI to deduplicate mails from mail boxes

1K 196 42
yaroslaff
hashget

deduplication tool for archiving data with extremely high ratio

1K 7 2
chr1st1ank
narrow-down

Fast fuzzy text search

1K 12 1
    • Data from PyPI, GitHub, ClickHouse, and BigQuery