Introduction


Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often reffered to as Named Entities. Entity extraction refers to automatic identification of named entities in a text document. Given a text document, entities such as Person names, Organization names, Location names, Product names are identified and tagged. Identification of named entities is very important for several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems.

    Over the past decade, Indian language content on various media types such as websites, blogs, email, chats has increased significantly. And it is observed that with the advent of smart phones more people are using social media such as twitter, facebook to comment on people, products, services, organizations, goverments. Thus we see content growth is driven by people from non-metros and small cities who are mostly comfortable in their own mother tongue rather than English. Though still this Indian language content is only a fraction of the English content. The growth of Indian language content is expected to increase by more than 70% every year. Hence there is great need to process this huge data automatically. Especially companies are interested to ascertain public view on their products and processes. This requires natural language processing software systems which identify entities, identification of associations or relation between entities. Hence an automatic Entity extraction system is required.



The objectives of this evaluation exercise are:

  • Creation of benchmark data for Entity Extraction in Indian language Social Media text.
  • Encourage researchers to develop Named Entity Recognition (NER) systems in Social Media text.
  • Providing oppurtunity to researchers to have comparison of different machine learning techniques.



Training Corpus


Training corpus has been released !!

To obtain the training data, all the registered participants are requested to fill and sign the copyright form, which is available in the below link
Copyright Form

Training data has been released for Hindi, Malayalam, Tamil and English.

Registration


Registration is now open !!!.
Please register by sending email to sobha@au-kbc.org with details
"Team Leader Name", "Team Affiliation", "Team Contact Person name" and "Email ID", "Languages for which participating"


Submission Format & Training Annotation Format


Training Corpus Annotation format description

For each language training data there is annotation file which has 6 columns each separated by tab space:

  1. Tweet_ID
  2. User_Id
  3. NE_TAG
  4. NE raw string
  5. NE Start_Index
  6. NE_Length
 For example:

Tweet ID:123456789012345678	User Id:1234567890	NETAG:ORGANIZATION	NE:SonyTV	Index:43	Length:6


  • Index column is the starting character position of the NE calculated for each tweet.
  • The participants are expected to give the same format tagged output for the test data also.
  • The test data will be provided in the same format as given in the training, except that the NE annotation file will not be there.
  • The participants have to submit their test runs in the format, exactly same as given in training annotation file.


  • Evaluation Criteria



    1. We will be using Precision, Recall and F-Measure metrics for evaluation.
    2. The test run submissions file must strictly following the format as explained in the earlier section.
      If the run submission format file does not follow the submission format, the run will be rejected.
    3. For the evaluation the fields "Tweet_ID", "USer_Id", "NE_TAG", "NE_Start_Index" and "NE_Length" should be matching with the gold standard.


    Task Coordinators - Organizing Committee


    Computational Linguistics Research Group (CLRG),
    AU-KBC Research Centre



    Pattabhi RK Rao, AU-KBC Research Centre, Chennai, India.
    Malarkodi CS, AU-KBC Research Centre, Chennai, India.
    Vijay Sundar Ram, AU-KBC Research Centre, Chennai, India.
    Sobha Lalitha Devi, (Chair) , AU-KBC Research Centre, Chennai, India.