Dependents of datasketch

65 dependents

Package	Description	Downloads/month
ydata-sdk	YData allows to use the Data-Centric tools from the YData ecosystem to acceler...	29K
cocoon-data	Data management with LLMs	10K
matchescu-matching	Common types and algorithms used for matching two information	9K
neo-reasoner	A self-improving code reasoning engine with persistent semantic memory	5K
vital-ai-vitalsigns	Knowledge Model Runtime, Ontology management, and interface to Graph and Vector ...	4K
feedship	A personal information delivery system - collect, subscribe to, and organize inf...	4K
liken	A Python library for near deduplication and record linkage	3K
open-dataflow	Modern Data Centric AI system for Large Language Models	3K
matchescu-comparison-space-generation		3K
qbindiff	QBindiff binary diffing tool based on a Network Alignment problem	2K
quid	Quid is a tool for quotation detection in texts and can deal with common propert...	2K
thoth-dbmanager	A Python library for managing SQL databases with support for multiple database t...	2K
tmap2	Tree-based visualization for high-dimensional data. Organizes similar items into...	1K
datasig	Dataset fingerprinting library	1K
streamdaq	🦆 Stream-first data quality monitoring in Python! Learn more: https://arxiv.org...	1K
starlet	Spatial tiling, MVT generation, and tile serving for geospatial data	1K
strwythura	Strwythura: construct an entity-resolved knowledge graph from structured data so...	1K
open-dataflow-adp	Easy Data Preparation with latest LLMs-based Operators and Pipelines.	1K
nn-dataset	Neural Network Dataset	1K
lbster	Language models for Biological Sequence Transformation and Evolutionary Represen...	1K
treepeat	treepeat: a tool to find similarities in a codebase	829
polydedupe	PolyDeDupe: Multi-Lingual Data Deduplication	819
resonance-lattice	Portable semantic model for knowledge	813
sketch	AI code-writing assistant that understands data content	808
pamola-core	Pamola Core library for data anonymization, privacy models, metrics, and utiliti...	737
hogprof	Phylogenetic profiling with orthology data	692
pyoma	library to interact and build OMA hdf5 files	684
thoth-sqldb	A Python module for managing SQL databases (SQLite and PostgreSQL) with support ...	650
guanciale	:bacon: Grab info needed by Carbonara from executables and disassemblers databas...	601
datamart-isi	USC ISI implementation of D3M Datamart API	601
auralith-data-pipeline	Production-grade data pipeline for training LLMs and multimodal AI — text, image...	565
lshcurator	High-throughput MinHash + LSH toolkit for large-scale text corpus deduplication ...	509
isage-neuromem	NeuroMem - Brain-inspired memory system for AI agents with multi-modal storage	508
sodata	processing web text data for NLP LLM	491
airev-scanner	AI code quality scanner — catches what copilots miss.	407
nn-gpt	LLM-Based Neural Network Generator	391
nn-dup	Prefix-aware curation & near-dedup for NN code via MinHash/LSH and AST fingerpri...	377
truthcheck	Open source AI content verification. Score claims 0-100 to detect misinformation...	367
impruver	A set of scripts and configurations for pretraining of Large Language Models (LL...	346
textanalyzer	Text Data Analysis module for analyzing text data in tabular data.	341
ryoma-ai	AI Powered Data Platform	320
scar-tool	SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs base...	315
corpuskit	Corpus analysis and processing toolkit	247
deepharvest	The world's most complete, resilient, multilingual web crawler	224
gweta	RAG data quality and ingestion framework - Acquire, Validate, Ingest	211
synrfp	Reaction fingerprint	211
nlp-dedup	Remove duplicates and near-duplicates from text corpora, no matter the scale.	207
research-assistant-llm	Intelligent research paper analysis pipeline with LLM-driven categorization	197
srdedupe	Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic refe...	189
metacraft	Toolkit to enrich, validate and explore YAML metadata from a pandas DataFrame.	187