ESM-IL Entity Extraction from Social Media Text in Indian Languages

Introduction

Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are often reffered to as Named Entities. Entity extraction refers to automatic identification of named entities in a text document. Given a text document, entities such as Person names, Organization names, Location names, Product names are identified and tagged. Identification of named entities is very important for several higher language technology systems such as information extraction systems, machine translation systems, and cross-lingual information access systems.

Over the past decade, Indian language content on various media types such as websites, blogs, email, chats has increased significantly. And it is observed that with the advent of smart phones more people are using social media such as twitter, facebook to comment on people, products, services, organizations, goverments. Thus we see content growth is driven by people from non-metros and small cities who are mostly comfortable in their own mother tongue rather than English. Though still this Indian language content is only a fraction of the English content. The growth of Indian language content is expected to increase by more than 70% every year. Hence there is great need to process this huge data automatically. Especially companies are interested to ascertain public view on their products and processes. This requires natural language processing software systems which identify entities, identification of associations or relation between entities. Hence an automatic Entity extraction system is required.

The objectives of this evaluation exercise are:

Creation of benchmark data for Entity Extraction in Indian language Social Media text.
Encourage researchers to develop Named Entity Recognition (NER) systems in Social Media text.
Providing oppurtunity to researchers to have comparison of different machine learning techniques.

Training Corpus

Training corpus has been released !!

To obtain the training data, all the registered participants are requested to fill and sign the copyright form, which is available in the below link
Copyright Form

Training data has been released for Hindi, Malayalam, Tamil and English.

Registration

Registration is now open !!!.
Please register by sending email to sobha@au-kbc.org with details
"Team Leader Name", "Team Affiliation", "Team Contact Person name" and "Email ID", "Languages for which participating"

Submission Format & Training Annotation Format

Training Corpus Annotation format description

For each language training data there is annotation file which has 6 columns each separated by tab space:

Tweet_ID
User_Id
NE_TAG
NE raw string
NE Start_Index
NE_Length

 For example:

Tweet ID:123456789012345678	User Id:1234567890	NETAG:ORGANIZATION	NE:SonyTV	Index:43	Length:6

Index column is the starting character position of the NE calculated for each tweet.
The participants are expected to give the same format tagged output for the test data also.
The test data will be provided in the same format as given in the training, except that the NE annotation file will not be there.
The participants have to submit their test runs in the format, exactly same as given in training annotation file.

Evaluation Criteria

We will be using Precision, Recall and F-Measure metrics for evaluation.

The test run submissions file must strictly following the format as explained in the earlier section.
If the run submission format file does not follow the submission format, the run will be rejected.

For the evaluation the fields "Tweet_ID", "USer_Id", "NE_TAG", "NE_Start_Index" and "NE_Length" should be matching with the gold standard.

Task Coordinators - Organizing Committee

Computational Linguistics Research Group (CLRG),
AU-KBC Research Centre

(Chair)

Entity Extraction fromSocial Media Text

Indian Languages (ESM-IL)

held in conjunction with the FIRE 2015 Forum for Information Retrieval Evaluation

4th - 6th December 2015, DAIICT, Gandhinagar

Important Dates