Computational Linguistic Research Group

AUKBC Tamil Part-of-Speech Corpus (AUKBC-TamilPOSCorpus2016v1)

&

AUKBC Tamil Part-of-Speech Engine (AUKBC-TamilPOSEngine2016v1)

Released on 24th May 2016, at WILD^RE 3, co-located with the10th edition of LREC, Portorož (Slovenia)

This is the largest POS tagged corpus available for Indian languages with 500K tokens annotated using the BIS POS tagset¹.

AUKBC-TamilPOSCorpus2016v1 is developed by the Computational Linguistic Research Group (CLRG), AU-KBC Research Centre, MIT Campus of Anna University, Chrompet, Chennai, India.

The source of the corpus is a very famous contemporary novel in Tamil "Ponniyin Selvan" written by "Kalki Krishnamoorthy”. It is a historical novel which narrates the story of Arulmozhivarman (later crowned as Rajaraja Chola I), one of the kings of the Chola Dynasty during the 10th and 11th centuries. Though this narrates the story of 10th and 11th centuries, the language style is in 20th century Tamil.

Annotation Schema:

The tagset used for annotation is the Bureau of Indian Standards Tagset called BIS Tagset, for Tamil which is also approved by Tamil Virtual Academy as the standard for Tamil. The tagset and annotation guidelines are given as a separate document. (POS Annotation Tagset)

Corpus Statistics:

Total Number of sentences	50,876
Number of tokens	5,15,283
Number of words	4,14,483 (without Punctuation and symbols)

AUKBC-TamilPOSCorpus2016v1 is released under restricted license and not allowed for redistribution. The complete License document is available here. (CorpusLicense.txt).

AUKBC-TamilPOSEngine2016v1 is released under the GNU GPL version 3.0 license. The complete license document is available here (http://www.gnu.org/licenses/gpl-3.0.txt).

The annotated corpus and the engine are freely downloadable. It is understood that the user by downloading the corpus or the POS tagger engine, agrees to the terms and conditions of the licenses.

Corpus Download:

The corpus and the tagger are access protected.
To obtain the access codes and the corpus please fill the following details and mail to sobha@au-kbc.org and copy paste with in quotes content as the subject line “Request for Tamil POS corpus and tagger access code”.

Person Name (Full name with Title/Surname):
Affiliation:
Address (Full address):
Email ID (please provide a valid email ID, so that we can keep you updated):
Mobile Number:
Purpose (Why do you need the corpus):

[1]BIS POS Standardisation

AU-KBC Research Centre

Computational Linguistics Research Group

AUKBC Tamil Part-of-Speech Corpus (AUKBC-TamilPOSCorpus2016v1)

&

AUKBC Tamil Part-of-Speech Engine (AUKBC-TamilPOSEngine2016v1)