EOS token is also padding token

by stephantulkens - opened Mar 10, 2025

Mar 10, 2025

•

edited Mar 10, 2025

Hello!

There's some weird behavior with the tokenizer. When encoding text using the tokenizer from HF, it does not include an eos token, e.g.:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.encode("dogs")
# output: [128000, 18964]

The tokenizer seems to use <end_of_text> as the padding token, however:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.batch_encode_plus(["dogs", "many cats"], padding=True)
# output: [[128000, 81134, 128001], [128000, 35676, 19987]]

When we look at the special tokens, the <end_of_text> token is indeed stored as both the padding token and eos_token. Because it is also stored as the pad token, any eos tokens are truncated after encoding.

{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|end_of_text|>',
 'pad_token': '<|end_of_text|>',
 'mask_token': '<|mask|>'}

A cursory look at the other tokens showed that there doesn't seem to be a dedicated padding token in the vocabulary.

When using the bare tokenizer (the backend model), every instance is padded until length 512 with <end_of_text> tokens. This is in the config (it has a fixed padding strategy with 512 tokens, but it looked a little bit weird to me.

So I'm just here to confirm whether this is intended, or whether the tokenizer should have a dedicated padding token which went missing. Thanks!

DuarteMRAlves

EuroBERT org Mar 10, 2025

Hello!

Thank you for your interest!

Indeed, you can use the eos token as a padding, this is what we have been doing during fine-tuning.
Regarding the default padding value being 512, that was a misconfiguration in the config file. I have updated it to remove that information.

It should be fixed if you redownload the tokenizer:

tokenizer = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", force_download=True)

Let me know if this fixes the issue for you!

Nicolas-BZRD changed discussion status to closed Mar 14, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment