Introduction


Named Entity Recognition(NER) Refers to automatic identification of named entities in a given text document. Given a text document, named entities such as Person names, Organization names, Location names, Product names are identified and tagged. Identification of named entities is important in several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems.

Over the past decade Indian language content on various media types such as websites, blogs, email, chats has increased significantly.Content growth is driven by people from non-metros and small cities. Need to process this huge data automatically especially companies are interested to ascertain public view on their products and processes. This requires natural language processing software systems which identify entities, identification of associations or relation between entities.Hence an automatic Named Entity recognizer is required.

The objectives of this evaluation exercise are:

  • Creation of benchmark data for Evaluation of Named Entity Recognition for Indian Languages
  • Encourage researchers to develop Named Entity Recognition (NER) systems for Indian languages.

Challenges in Indian Language NER

  • Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
  • The challenges in NER arise due to several factors. Some of the main factors are listed below
    1. Morphologically rich - identification of root is difficult, require use of morphological analysers
    2. No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
    3. Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
    4. Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".



NER Annotated Corpus


The FIRE NER 2013 evaluation exercise is over. Any researcher who is interested in obtaining the annotated corpus for research may please contact by sending an email to sobha@au-kbc.org.
The researchers who are interested in obtaining the corpus should mention in their their full details such as description of their research work, their affiliation details, languages in which they are working etc.

NER Corpus can be downloaded from the below links:
English - Click Here
Hindi - Click Here
Tamil - Click Here
Malayalam - Click Here
Bengali - Click Here


The whole corpus is provided. The researchers may perform a n-fold experiment by partitioning the corpus accordingly. The corpus is protected, the participants will be provided with access code after registering by writing an email as said above.

Evaluation & Results

The evaluation metrics used were Precision, Recall and F-measure. Eight teams registered and only five teams could submit the runs, with a total of 9 submissions. The teams who submitted the runs are viz.,

  1. Systems Research Lab, Tata Research Development and Design Centre (TRDDC)
  2. Indian School of Mines , Dhanbad (ISM Dhanbad)
  3. Indian Statistical Institute, Kolkata (ISI Kolkata)
  4. CFILT Lab, Indian Institute of Technology Bombay (IIT-B)
  5. Malaviya National Institute of Technology (MNIT)
Results

LanguageTeam SystemIDPrecisionRecallF-Measure
BengaliISI Kolkata Sys 123.6928.0225.68
ISI Kolkata Sys 228.6116.0920.59
EnglishTRDDC Sys 164.7967.2365.99
TRDCC Sys 264.9268.6366.73
ISM Sys 114.8932.0220.33
ISM Sys 239.3334.4636.74
HindiTRDCC47.5168.3556.06
IITB83.6874.1478.62
MNIT01.7204.8202.53


Organizing Committee

Sobha Lalitha Devi
CLR Group @ AU-KBC Research Centre, Chennai, India.
Contact: sobha@au-kbc.org