49 dependents
| Package | Description | Downloads/month |
|---|---|---|
| Data and tools for generating and inspecting OLMo pre-training data. | 44K | |
| Language detection using Spacy and Fasttext | 43K | |
| Open Source Neural Machine Translation and (Large) Language Models in PyTorch | 23K | |
| 80x faster and 95% accurate language identification with Fasttext | 3K | |
| A machine learning tool that ranks strings based on their relevance for malware ... | 3K | |
| Modern Data Centric AI system for Large Language Models | 3K | |
| S2AND | 2K | |
| Here we collected some online and offline models for text tagging. | 1K | |
| Targetted language identifier, based on FastText and Hunspell. | 1K | |
| Preprocessing and Extraction of Linguistic Information for Computational Analysi... | 1K | |
| Easy Data Preparation with latest LLMs-based Operators and Pipelines. | 1K | |
| Pre-filtering step for bicleaner | 1K | |
| Lint .po translation files for contamination, wrong languages, shifts, and garbl... | 1K | |
| A simple language detection library for short texts. | 899 | |
| Pipeline for querying and turning NASA's ADS publications metadata into curated,... | 890 | |
| ⚡️ Build Your Own chatgpt Bot|🧀 Discord/Slack/Kook/Telegram |⛓ ToolCall|🔖 Plugin... | 847 | |
| Open language modeling toolkit based on PyTorch | 752 | |
| Pamola Core library for data anonymization, privacy models, metrics, and utiliti... | 737 | |
| Blazing fast language detection using fastText model | 693 | |
| An open-source simplifies ETL workflow with Python based on Spark | 682 | |
| A SapientML plugin of preprocess CodeBlockGenerator | 643 | |
| Detect language of a given text, fast | 577 | |
| Saujana NLP for World Embedding | 564 | |
| Text Machina: Seamless Generation of Machine-Generated Text Datasets | 513 | |
| Here I collected some online and offline models for text tagging. | 444 | |
| Saujana NLP for World Embedding | 435 | |
| Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, Uzbek, Turkish, ... | 422 | |
| A tiny package (and standalone script) for downloading any pretrained fasttext w... | 402 | |
| A Python module that adds features to OpenLA data to make it easier to use for M... | 398 | |
| Robust Language Identification using an ensemble of 5-7 LID backends | 354 | |
| Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/g... | 249 | |
| A data processor package | 237 | |
| Detects the language of text | 226 | |
| LMOps Tool for Korean | 223 | |
| Framework for creating, running and validation of ML models on tabular data | 213 | |
| Detect quality of (digitized) text. | 207 | |
| An open-source simplifies ETL workflow with Python based on Spark | 193 | |
| Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaC... | 193 | |
| A data processor package | 173 | |
| Requirements Similarity tool for Software Product Lines | 163 | |
| SyGra - Graph-oriented Synthetic data generation Pipeline | 158 | |
| Text-tagging project within Yandex x HSE StudCamp event | 155 | |
| 107 | ||
| Advanced AI Optimization Toolkit | 87 | |
| A collection of utility functions for my projects. | 79 | |
| chatbot client for llm | 77 | |
| Detects and fixes AI word hallucinations in multilingual text | 69 | |
| Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool | 49 | |
| LLM Web Kit for processing web content | 16 |