Instructions to use EuroBERT/EuroBERT-210m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EuroBERT/EuroBERT-210m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="EuroBERT/EuroBERT-210m", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("EuroBERT/EuroBERT-210m", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
EOS token is also padding token
Hello!
There's some weird behavior with the tokenizer. When encoding text using the tokenizer from HF, it does not include an eos token, e.g.:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.encode("dogs")
# output: [128000, 18964]
The tokenizer seems to use <end_of_text> as the padding token, however:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m")
tok.batch_encode_plus(["dogs", "many cats"], padding=True)
# output: [[128000, 81134, 128001], [128000, 35676, 19987]]
When we look at the special tokens, the <end_of_text> token is indeed stored as both the padding token and eos_token. Because it is also stored as the pad token, any eos tokens are truncated after encoding.
{'bos_token': '<|begin_of_text|>',
'eos_token': '<|end_of_text|>',
'pad_token': '<|end_of_text|>',
'mask_token': '<|mask|>'}
A cursory look at the other tokens showed that there doesn't seem to be a dedicated padding token in the vocabulary.
When using the bare tokenizer (the backend model), every instance is padded until length 512 with <end_of_text> tokens. This is in the config (it has a fixed padding strategy with 512 tokens, but it looked a little bit weird to me.
So I'm just here to confirm whether this is intended, or whether the tokenizer should have a dedicated padding token which went missing. Thanks!
Hello!
Thank you for your interest!
Indeed, you can use the eos token as a padding, this is what we have been doing during fine-tuning.
Regarding the default padding value being 512, that was a misconfiguration in the config file. I have updated it to remove that information.
It should be fixed if you redownload the tokenizer:
tokenizer = AutoTokenizer.from_pretrained("EuroBERT/EuroBERT-210m", force_download=True)
Let me know if this fixes the issue for you!