PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Search Packages

Find Python packages by name, description, GitHub topic, or filter by metrics
tusharsadhwani
pytokens

A fast, spec compliant Python 3.14+ tokenizer that runs on older Pythons.

72.4M 4 7
hplt-project
sacremoses

Python port of Moses tokenizer, truecaser and normalizer

2.6M 495 59
polm
fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

567K 518 39
taishi-i
nagisa

A Japanese tokenizer based on recurrent neural networks

231K 417 23
lovit
soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

128K 983 184
adbar
simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

90K 195 15
berkmancenter
sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

61K 258 34
OpenNMT
pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

61K 333 82
mideind
tokenizer

A tokenizer for Icelandic text.

50K 30 8
natasha
natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

49K 1K 116
izikeros
count-tokens

Count tokens in a text file.

46K 13 0
ngocjr7
sctokenizer

A Source Code Tokenizer

32K 13 6
davidpirogov
toon-llm

Token-Oriented Object Notation (TOON) is an LLM-optimized data serialization format implemented in Python.

29K 9 3
ModelCloud
tokenicer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

26K 11 4
OpenPecha
botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

22K 81 15
OpenVoiceOS
quebra-frases

chunks strings into byte sized pieces

13K 1 3
cahya-wirawan
pyrwkv-tokenizer

A fast RWKV Tokenizer written in Rust

13K 54 5
trag1c
crossandra

A fast and simple tokenization library for Python operating on enums and regular expressions, with a decent amount of configuration.

12K 9 1
naturalness
javac-parser

Exposes OpenJDK's Java parser and scanner to Python

11K 7 4
PyThaiNLP
nlpo3

Thai natural language processing library in Rust, with Python and Node bindings.

11K 44 13
lindera
lindera-python

A multilingual morphological analysis library.

10K 625 58
cereja-project
cereja

Cereja is a bundle of useful functions we don't want to rewrite and .. just pure fun!

9K 29 12
roshan-research
hazm

Persian NLP Toolkit

9K 1K 205
artitw
text2text

Text2Text Language Modeling Toolkit

6K 304 41
    • Data from PyPI, GitHub, ClickHouse, and BigQuery