49 dependents
Package Description Downloads/month
Data and tools for generating and inspecting OLMo pre-training data. 44K
Language detection using Spacy and Fasttext 43K
Open Source Neural Machine Translation and (Large) Language Models in PyTorch 23K
80x faster and 95% accurate language identification with Fasttext 3K
A machine learning tool that ranks strings based on their relevance for malware ... 3K
Modern Data Centric AI system for Large Language Models 3K
S2AND 2K
Here we collected some online and offline models for text tagging. 1K
Targetted language identifier, based on FastText and Hunspell. 1K
Preprocessing and Extraction of Linguistic Information for Computational Analysi... 1K
Easy Data Preparation with latest LLMs-based Operators and Pipelines. 1K
Pre-filtering step for bicleaner 1K
Lint .po translation files for contamination, wrong languages, shifts, and garbl... 1K
A simple language detection library for short texts. 899
Pipeline for querying and turning NASA's ADS publications metadata into curated,... 890
⚡️ Build Your Own chatgpt Bot|🧀 Discord/Slack/Kook/Telegram |⛓ ToolCall|🔖 Plugin... 847
Open language modeling toolkit based on PyTorch 752
Pamola Core library for data anonymization, privacy models, metrics, and utiliti... 737
Blazing fast language detection using fastText model 693
An open-source simplifies ETL workflow with Python based on Spark 682
A SapientML plugin of preprocess CodeBlockGenerator 643
Detect language of a given text, fast 577
Saujana NLP for World Embedding 564
Text Machina: Seamless Generation of Machine-Generated Text Datasets 513
Here I collected some online and offline models for text tagging. 444
Saujana NLP for World Embedding 435
Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, Uzbek, Turkish, ... 422
A tiny package (and standalone script) for downloading any pretrained fasttext w... 402
A Python module that adds features to OpenLA data to make it easier to use for M... 398
Robust Language Identification using an ensemble of 5-7 LID backends 354
Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/g... 249
A data processor package 237
Detects the language of text 226
LMOps Tool for Korean 223
Framework for creating, running and validation of ML models on tabular data 213
Detect quality of (digitized) text. 207
An open-source simplifies ETL workflow with Python based on Spark 193
Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaC... 193
A data processor package 173
Requirements Similarity tool for Software Product Lines 163
SyGra - Graph-oriented Synthetic data generation Pipeline 158
Text-tagging project within Yandex x HSE StudCamp event 155
107
Advanced AI Optimization Toolkit 87
A collection of utility functions for my projects. 79
chatbot client for llm 77
Detects and fixes AI word hallucinations in multilingual text 69
Dingo: A Comprehensive AI Data, Model and Application Quality Evaluation Tool 49
LLM Web Kit for processing web content 16