HuggingFace Tokenizers - Fast Tokenization for NLP
Fast, production-ready tokenizers with Rust performance and Python ease-of-use.
When to use HuggingFace Tokenizers
Use HuggingFace Tokenizers when:
-
Need extremely fast tokenization (<20s per GB of text)
-
Training custom tokenizers from scratch
-
Want alignment tracking (token → original text position)
-
Building production NLP pipelines
-
Need to tokenize large corpora efficiently
Performance:
-
Speed: <20 seconds to tokenize 1GB on CPU
-
Implementation: Rust core with Python/Node.js bindings
-
Efficiency: 10-100× faster than pure Python implementations
Use alternatives instead:
-
SentencePiece: Language-independent, used by T5/ALBERT
-
tiktoken: OpenAI's BPE tokenizer for GPT models
-
transformers AutoTokenizer: Loading pretrained only (uses this library internally)
Quick start
Installation
Install tokenizers
pip install tokenizers
With transformers integration
pip install tokenizers transformers
Load pretrained tokenizer
from tokenizers import Tokenizer
Load from HuggingFace Hub
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Encode text
output = tokenizer.encode("Hello, how are you?") print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?'] print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]
Decode back
text = tokenizer.decode(output.ids) print(text) # "hello, how are you?"
Train custom BPE tokenizer
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import Whitespace
Initialize tokenizer with BPE model
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = Whitespace()
Configure trainer
trainer = BpeTrainer( vocab_size=30000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2 )
Train on files
files = ["train.txt", "validation.txt"] tokenizer.train(files, trainer)
Save
tokenizer.save("my-tokenizer.json")
Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB
Batch encoding with padding
Enable padding
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")
Encode batch
texts = ["Hello world", "This is a longer sentence"] encodings = tokenizer.encode_batch(texts)
for encoding in encodings: print(encoding.ids)
[101, 7592, 2088, 102, 3, 3, 3]
[101, 2023, 2003, 1037, 2936, 6251, 102]
Tokenization algorithms
BPE (Byte-Pair Encoding)
How it works:
-
Start with character-level vocabulary
-
Find most frequent character pair
-
Merge into new token, add to vocabulary
-
Repeat until vocabulary size reached
Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa
from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel
tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>")) tokenizer.pre_tokenizer = ByteLevel()
trainer = BpeTrainer( vocab_size=50257, special_tokens=["<|endoftext|>"], min_frequency=2 )
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
-
Handles OOV words well (breaks into subwords)
-
Flexible vocabulary size
-
Good for morphologically rich languages
Trade-offs:
-
Tokenization depends on merge order
-
May split common words unexpectedly
WordPiece
How it works:
-
Start with character vocabulary
-
Score merge pairs: frequency(pair) / (frequency(first) × frequency(second))
-
Merge highest scoring pair
-
Repeat until vocabulary size reached
Used by: BERT, DistilBERT, MobileBERT
from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = BertNormalizer(lowercase=True) tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], continuing_subword_prefix="##" )
tokenizer.train(files=["corpus.txt"], trainer=trainer)
Advantages:
-
Prioritizes meaningful merges (high score = semantically related)
-
Used successfully in BERT (state-of-the-art results)
Trade-offs:
-
Unknown words become [UNK] if no subword match
-
Saves vocabulary, not merge rules (larger files)
Unigram
How it works:
-
Start with large vocabulary (all substrings)
-
Compute loss for corpus with current vocabulary
-
Remove tokens with minimal impact on loss
-
Repeat until vocabulary size reached
Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)
from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer( vocab_size=8000, special_tokens=["<unk>", "<s>", "</s>"], unk_token="<unk>" )
tokenizer.train(files=["data.txt"], trainer=trainer)
Advantages:
-
Probabilistic (finds most likely tokenization)
-
Works well for languages without word boundaries
-
Handles diverse linguistic contexts
Trade-offs:
-
Computationally expensive to train
-
More hyperparameters to tune
Tokenization pipeline
Complete pipeline: Normalization → Pre-tokenization → Model → Post-processing
Normalization
Clean and standardize text:
from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence
tokenizer.normalizer = Sequence([ NFD(), # Unicode normalization (decompose) Lowercase(), # Convert to lowercase StripAccents() # Remove accents ])
Input: "Héllo WORLD"
After normalization: "hello world"
Common normalizers:
-
NFD , NFC , NFKD , NFKC
-
Unicode normalization forms
-
Lowercase()
-
Convert to lowercase
-
StripAccents()
-
Remove accents (é → e)
-
Strip()
-
Remove whitespace
-
Replace(pattern, content)
-
Regex replacement
Pre-tokenization
Split text into word-like units:
from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel
Split on whitespace and punctuation
tokenizer.pre_tokenizer = Sequence([ Whitespace(), Punctuation() ])
Input: "Hello, world!"
After pre-tokenization: ["Hello", ",", "world", "!"]
Common pre-tokenizers:
-
Whitespace()
-
Split on spaces, tabs, newlines
-
ByteLevel()
-
GPT-2 style byte-level splitting
-
Punctuation()
-
Isolate punctuation
-
Digits(individual_digits=True)
-
Split digits individually
-
Metaspace()
-
Replace spaces with ▁ (SentencePiece style)
Post-processing
Add special tokens for model input:
from tokenizers.processors import TemplateProcessing
BERT-style: [CLS] sentence [SEP]
tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B [SEP]", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], )
Common patterns:
GPT-2: sentence <|endoftext|>
TemplateProcessing( single="$A <|endoftext|>", special_tokens=[("<|endoftext|>", 50256)] )
RoBERTa: <s> sentence </s>
TemplateProcessing( single="<s> $A </s>", pair="<s> $A </s> </s> $B </s>", special_tokens=[("<s>", 0), ("</s>", 2)] )
Alignment tracking
Track token positions in original text:
output = tokenizer.encode("Hello, world!")
Get token offsets
for token, offset in zip(output.tokens, output.offsets): start, end = offset print(f"{token:10} → [{start:2}, {end:2}): {text[start:end]!r}")
Output:
hello → [ 0, 5): 'Hello'
, → [ 5, 6): ','
world → [ 7, 12): 'world'
! → [12, 13): '!'
Use cases:
-
Named entity recognition (map predictions back to text)
-
Question answering (extract answer spans)
-
Token classification (align labels to original positions)
Integration with transformers
Load with AutoTokenizer
from transformers import AutoTokenizer
AutoTokenizer automatically uses fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Check if using fast tokenizer
print(tokenizer.is_fast) # True
Access underlying tokenizers.Tokenizer
fast_tokenizer = tokenizer.backend_tokenizer print(type(fast_tokenizer)) # <class 'tokenizers.Tokenizer'>
Convert custom tokenizer to transformers
from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast
Train custom tokenizer
tokenizer = Tokenizer(BPE())
... train tokenizer ...
tokenizer.save("my-tokenizer.json")
Wrap for transformers
transformers_tokenizer = PreTrainedTokenizerFast( tokenizer_file="my-tokenizer.json", unk_token="[UNK]", pad_token="[PAD]", cls_token="[CLS]", sep_token="[SEP]", mask_token="[MASK]" )
Use like any transformers tokenizer
outputs = transformers_tokenizer( "Hello world", padding=True, truncation=True, max_length=512, return_tensors="pt" )
Common patterns
Train from iterator (large datasets)
from datasets import load_dataset
Load dataset
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
Create batch iterator
def batch_iterator(batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i:i + batch_size]["text"]
Train tokenizer
tokenizer.train_from_iterator( batch_iterator(), trainer=trainer, length=len(dataset) # For progress bar )
Performance: Processes 1GB in ~10-20 minutes
Enable truncation and padding
Enable truncation
tokenizer.enable_truncation(max_length=512)
Enable padding
tokenizer.enable_padding( pad_id=tokenizer.token_to_id("[PAD]"), pad_token="[PAD]", length=512 # Fixed length, or None for batch max )
Encode with both
output = tokenizer.encode("This is a long sentence that will be truncated...") print(len(output.ids)) # 512
Multi-processing
from tokenizers import Tokenizer from multiprocessing import Pool
Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
def encode_batch(texts): return tokenizer.encode_batch(texts)
Process large corpus in parallel
with Pool(8) as pool: # Split corpus into chunks chunk_size = 1000 chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]
# Encode in parallel
results = pool.map(encode_batch, chunks)
Speedup: 5-8× with 8 cores
Performance benchmarks
Training speed
Corpus Size BPE (30k vocab) WordPiece (30k) Unigram (8k)
10 MB 15 sec 18 sec 25 sec
100 MB 1.5 min 2 min 4 min
1 GB 15 min 20 min 40 min
Hardware: 16-core CPU, tested on English Wikipedia
Tokenization speed
Implementation 1 GB corpus Throughput
Pure Python ~20 minutes ~50 MB/min
HF Tokenizers ~15 seconds ~4 GB/min
Speedup 80× 80×
Test: English text, average sentence length 20 words
Memory usage
Task Memory
Load tokenizer ~10 MB
Train BPE (30k vocab) ~200 MB
Encode 1M sentences ~500 MB
Supported models
Pre-trained tokenizers available via from_pretrained() :
BERT family:
-
bert-base-uncased , bert-large-cased
-
distilbert-base-uncased
-
roberta-base , roberta-large
GPT family:
-
gpt2 , gpt2-medium , gpt2-large
-
distilgpt2
T5 family:
-
t5-small , t5-base , t5-large
-
google/flan-t5-xxl
Other:
-
facebook/bart-base , facebook/mbart-large-cc25
-
albert-base-v2 , albert-xlarge-v2
-
xlm-roberta-base , xlm-roberta-large
Browse all: https://huggingface.co/models?library=tokenizers
References
-
Training Guide - Train custom tokenizers, configure trainers, handle large datasets
-
Algorithms Deep Dive - BPE, WordPiece, Unigram explained in detail
-
Pipeline Components - Normalizers, pre-tokenizers, post-processors, decoders
-
Transformers Integration - AutoTokenizer, PreTrainedTokenizerFast, special tokens
Resources
-
GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
-
Version: 0.20.0+
-
Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)