PyPI Stats
  • Insights
  • PyPI
  • GitHub
  • Search
  • Compare
  • Advisories
  • Ecosystem
  • About
Home

Tokenizer Python Packages

Python packages with the GitHub topic tokenizer. Sorted by relevance, with stars and monthly downloads.
tusharsadhwani
pytokens

A fast, spec compliant Python 3.14+ tokenizer that runs on older Pythons.

73.3M 4 7
hplt-project
sacremoses

Python port of Moses tokenizer, truecaser and normalizer

2.6M 495 59
polm
fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

567K 518 39
taishi-i
nagisa

A Japanese tokenizer based on recurrent neural networks

228K 417 23
lovit
soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

129K 983 184
adbar
simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

94K 195 15
OpenNMT
pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

62K 333 82
berkmancenter
sentence-splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

58K 258 34
mideind
tokenizer

A tokenizer for Icelandic text.

52K 30 8
natasha
natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

49K 1K 116
izikeros
count-tokens

Count tokens in a text file.

45K 13 0
ngocjr7
sctokenizer

A Source Code Tokenizer

36K 13 6
davidpirogov
toon-llm

Token-Oriented Object Notation (TOON) is an LLM-optimized data serialization format implemented in Python.

28K 9 3
ModelCloud
tokenicer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

26K 11 4
OpenPecha
botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python

23K 81 15
cahya-wirawan
pyrwkv-tokenizer

A fast RWKV Tokenizer written in Rust

13K 54 5
OpenVoiceOS
quebra-frases

chunks strings into byte sized pieces

12K 1 3
trag1c
crossandra

A fast and simple tokenization library for Python operating on enums and regular expressions, with a decent amount of configuration.

11K 9 1
PyThaiNLP
nlpo3

Thai natural language processing library in Rust, with Python and Node bindings.

11K 44 13
naturalness
javac-parser

Exposes OpenJDK's Java parser and scanner to Python

10K 7 4
cereja-project
cereja

Cereja is a bundle of useful functions we don't want to rewrite and .. just pure fun!

10K 29 12
roshan-research
hazm

Persian NLP Toolkit

9K 1K 205
lindera
lindera-python

A multilingual morphological analysis library.

8K 625 58
krahd
modelito

Lightweight Python abstractions and connectors for LLM providers (OpenAI, Claude, Gemini, Ollama).

6K 0 0
    • Data from PyPI, GitHub, ClickHouse, and BigQuery