65 dependents
Package Description Downloads/month
YData allows to use the *Data-Centric* tools from the YData ecosystem to acceler... 29K
Data management with LLMs 10K
Common types and algorithms used for matching two information 9K
A self-improving code reasoning engine with persistent semantic memory 5K
Knowledge Model Runtime, Ontology management, and interface to Graph and Vector ... 4K
A personal information delivery system - collect, subscribe to, and organize inf... 4K
A Python library for near deduplication and record linkage 3K
Modern Data Centric AI system for Large Language Models 3K
3K
QBindiff binary diffing tool based on a Network Alignment problem 2K
Quid is a tool for quotation detection in texts and can deal with common propert... 2K
A Python library for managing SQL databases with support for multiple database t... 2K
Tree-based visualization for high-dimensional data. Organizes similar items into... 1K
Dataset fingerprinting library 1K
🦆 Stream-first data quality monitoring in Python! Learn more: https://arxiv.org... 1K
Spatial tiling, MVT generation, and tile serving for geospatial data 1K
Strwythura: construct an entity-resolved knowledge graph from structured data so... 1K
Easy Data Preparation with latest LLMs-based Operators and Pipelines. 1K
Neural Network Dataset 1K
Language models for Biological Sequence Transformation and Evolutionary Represen... 1K
treepeat: a tool to find similarities in a codebase 829
PolyDeDupe: Multi-Lingual Data Deduplication 819
Portable semantic model for knowledge 813
AI code-writing assistant that understands data content 808
Pamola Core library for data anonymization, privacy models, metrics, and utiliti... 737
Phylogenetic profiling with orthology data 692
library to interact and build OMA hdf5 files 684
A Python module for managing SQL databases (SQLite and PostgreSQL) with support ... 650
:bacon: Grab info needed by Carbonara from executables and disassemblers databas... 601
USC ISI implementation of D3M Datamart API 601
Production-grade data pipeline for training LLMs and multimodal AI — text, image... 565
High-throughput MinHash + LSH toolkit for large-scale text corpus deduplication ... 509
NeuroMem - Brain-inspired memory system for AI agents with multi-modal storage 508
processing web text data for NLP LLM 491
AI code quality scanner — catches what copilots miss. 407
LLM-Based Neural Network Generator 391
Prefix-aware curation & near-dedup for NN code via MinHash/LSH and AST fingerpri... 377
Open source AI content verification. Score claims 0-100 to detect misinformation... 367
A set of scripts and configurations for pretraining of Large Language Models (LL... 346
Text Data Analysis module for analyzing text data in tabular data. 341
AI Powered Data Platform 320
SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs base... 315
Corpus analysis and processing toolkit 247
The world's most complete, resilient, multilingual web crawler 224
RAG data quality and ingestion framework - Acquire, Validate, Ingest 211
Reaction fingerprint 211
Remove duplicates and near-duplicates from text corpora, no matter the scale. 207
Intelligent research paper analysis pipeline with LLM-driven categorization 197
Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic refe... 189
Toolkit to enrich, validate and explore YAML metadata from a pandas DataFrame. 187