WikiNEuRal Dataset

DOI: 10.18653/v1/2021.findings-emnlp.215
Contributors: Simone Tedeschi, Valentino Maiorca, Niccolò Campolungo, Francesco Cecconi, Roberto Navigli
Datarows: 1,004,730 text strings

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

In a nutshell, WikiNEuRal consists in a novel technique which builds upon a multilingual lexical knowledge base (i.e., BabelNet) and transformer-based architectures (i.e., BERT) to produce high-quality annotations for multilingual NER. It shows consistent improvements of up to 6 span-based F1-score points against state-of-the-art alternative data production methods on common benchmarks for NER. We used this methodology to automatically generate training data for NER in 9 languages. Learn more about the dataset here.

[WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER](https://aclanthology.org/2021.findings-emnlp.215) (Tedeschi et al., Findings 2021)
CC BY-NC-SA 3.0 (see more)