In [1], we create NER datasets containing short sentences and queries with low-context, including LOWNER, MSQ-NER, ORCAS-NER and Gazetteers (1.67 million entities). See details here. This release contains the multilingual versions of these datasets, i.e. the data used in [2], where we create:
The dataset is stored at the public Amazon S3 bucket: code-mixed-ner
. See more in Open Data on AWS.
You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset use:
aws s3 cp s3://code-mixed-ner ./ --recursive --no-sign-request
GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input. 2021. Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of NAACL.
Gazetteer Enhanced Named Entity Recognition for Code-Mixed Web Queries. 2021. Besnik Fetahu, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of SIGIR.