65 dependents
| Package | Description | Downloads/month |
|---|---|---|
| YData allows to use the *Data-Centric* tools from the YData ecosystem to acceler... | 29K | |
| Data management with LLMs | 10K | |
| Common types and algorithms used for matching two information | 9K | |
| A self-improving code reasoning engine with persistent semantic memory | 5K | |
| Knowledge Model Runtime, Ontology management, and interface to Graph and Vector ... | 4K | |
| A personal information delivery system - collect, subscribe to, and organize inf... | 4K | |
| A Python library for near deduplication and record linkage | 3K | |
| Modern Data Centric AI system for Large Language Models | 3K | |
| 3K | ||
| QBindiff binary diffing tool based on a Network Alignment problem | 2K | |
| Quid is a tool for quotation detection in texts and can deal with common propert... | 2K | |
| A Python library for managing SQL databases with support for multiple database t... | 2K | |
| Tree-based visualization for high-dimensional data. Organizes similar items into... | 1K | |
| Dataset fingerprinting library | 1K | |
| 🦆 Stream-first data quality monitoring in Python! Learn more: https://arxiv.org... | 1K | |
| Spatial tiling, MVT generation, and tile serving for geospatial data | 1K | |
| Strwythura: construct an entity-resolved knowledge graph from structured data so... | 1K | |
| Easy Data Preparation with latest LLMs-based Operators and Pipelines. | 1K | |
| Neural Network Dataset | 1K | |
| Language models for Biological Sequence Transformation and Evolutionary Represen... | 1K | |
| treepeat: a tool to find similarities in a codebase | 829 | |
| PolyDeDupe: Multi-Lingual Data Deduplication | 819 | |
| Portable semantic model for knowledge | 813 | |
| AI code-writing assistant that understands data content | 808 | |
| Pamola Core library for data anonymization, privacy models, metrics, and utiliti... | 737 | |
| Phylogenetic profiling with orthology data | 692 | |
| library to interact and build OMA hdf5 files | 684 | |
| A Python module for managing SQL databases (SQLite and PostgreSQL) with support ... | 650 | |
| :bacon: Grab info needed by Carbonara from executables and disassemblers databas... | 601 | |
| USC ISI implementation of D3M Datamart API | 601 | |
| Production-grade data pipeline for training LLMs and multimodal AI — text, image... | 565 | |
| High-throughput MinHash + LSH toolkit for large-scale text corpus deduplication ... | 509 | |
| NeuroMem - Brain-inspired memory system for AI agents with multi-modal storage | 508 | |
| processing web text data for NLP LLM | 491 | |
| AI code quality scanner — catches what copilots miss. | 407 | |
| LLM-Based Neural Network Generator | 391 | |
| Prefix-aware curation & near-dedup for NN code via MinHash/LSH and AST fingerpri... | 377 | |
| Open source AI content verification. Score claims 0-100 to detect misinformation... | 367 | |
| A set of scripts and configurations for pretraining of Large Language Models (LL... | 346 | |
| Text Data Analysis module for analyzing text data in tabular data. | 341 | |
| AI Powered Data Platform | 320 | |
| SCAR: An AI-powered tool for ranking and filtering instruction-answer pairs base... | 315 | |
| Corpus analysis and processing toolkit | 247 | |
| The world's most complete, resilient, multilingual web crawler | 224 | |
| RAG data quality and ingestion framework - Acquire, Validate, Ingest | 211 | |
| Reaction fingerprint | 211 | |
| Remove duplicates and near-duplicates from text corpora, no matter the scale. | 207 | |
| Intelligent research paper analysis pipeline with LLM-driven categorization | 197 | |
| Safe, FPR-controlled, reproducible deduplication pipeline for bibliographic refe... | 189 | |
| Toolkit to enrich, validate and explore YAML metadata from a pandas DataFrame. | 187 |