OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
Abstract
OpenNER 1.0 is a standardized collection of named entity recognition datasets across multiple languages and ontologies, enabling research with multilingual and multi-ontology models.
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner.
Community
Thanks for asking, and sorry about there not being a dataset right now. We had to preprint this a little ahead of time so it can be cited by another paper, so the paper is slightly ahead of the data. We're going to put up a new preprint next week along with the dataset release and a GitHub repo.
Hello, will you still send this dataset? This dataset will be of great help to me.
Hi @stefan-it and @janeenmadena , I wanted to let you know we've updated the preprint with a new set of LLM results and a number of other improvements. We had originally posted the preprint earlier than would be ideal because we needed another submission under review to cite it. We are still making improvements to OpenNER based on reviewer feedback, so we are waiting until it's accepted for publication to share the data. We found that reviewers are pointing out a lot of important things we want to address before we do a full public release.
Hi
@stefan-it
and
@janeenmadena
, our releases are here on HF:
https://huggingface.co/datasets/bltlab/open-ner-standardized
https://huggingface.co/datasets/bltlab/open-ner-core-types
Also on GitHub at:
https://github.com/bltlab/open-ner
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper