Multilingual Name Entity Recognition (NER) Datasets with Gazetteer

Description

In [1], we create NER datasets containing short sentences and queries with low-context, including LOWNER, MSQ-NER, ORCAS-NER and Gazetteers (1.67 million entities). See details here. This release contains the multilingual versions of these datasets, i.e. the data used in [2], where we create:

  1. A multilingual LOWNER, i.e. mLOWNER
  2. A multilingual ORCAS-NER, i.e. EMBER
  3. MultiLingual Gazetteers.

License

CC BY 4.0

How to Download

The dataset is stored at the public Amazon S3 bucket: code-mixed-ner. See more in Open Data on AWS.

You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset use:

aws s3 cp s3://code-mixed-ner  ./ --recursive --no-sign-request

Reference

  1. GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input. 2021. Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of NAACL.

  2. Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries. 2021. Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of SIGIR.