Year 1
Year 2
Year 3
Year 1

Foundation & Initial Development

Establishing infrastructure, corpora creation, and baseline systems

📋

Documentation & Planning

  • Project documentation and System Required Specification (SRS)
  • Guidelines and standards for corpora creation
  • Evaluation methods development
📚

Data Collection

  • 1 Lakh sentences for Tamil (25K per domain)
  • 1 Lakh sentences for Tamil-Hindi parallel corpus
  • 50,000 sentences for Tamil-Kannada, Malayalam, Telugu pairs
  • 25K Kannada monolingual corpora creation
🔧

System Development - Kannada

  • Morphological Analyser (60% accuracy)
  • POS tagging system
  • Chunk and clause boundary detection
  • Transfer grammar
  • Morphological generator
⚙️

Sampark Pipeline

  • Tamil-Kannada system (55% accuracy)
  • Enhancement for Hindi↔Tamil
  • Enhancement for Tamil↔Malayalam
  • Enhancement for Tamil↔Telugu
🤖

NMT Development

  • Model building for Hindi-Tamil (bi-directional)
  • Model building for Tamil-Malayalam (bi-directional)
  • Model building for Tamil-Telugu (bi-directional)
📊

Evaluation Infrastructure

  • Benchmark data for selected domains
  • Domain dictionaries development
  • Evaluation leaderboard development
💬

Discourse Analysis - Phase 1

  • Annotation guidelines and tagset
  • Automatic discourse chunk tagging system
  • 70,000 coherent chunks (Hindi & Tamil)
  • 35,000 coherent chunks (Kannada, Malayalam, Telugu)
🔗

Resolution Systems

  • Anaphora resolution (Pronominal)
  • Co-reference resolution (Pronominal, Noun-Noun, Definite)
  • Connective resolution
  • Conversation analysis (Ellipsis & Gaps)
  • Platform development for discourse handling
Year 2

Enhancement & Integration

Improving accuracy, domain adaptation, and discourse integration

📈

System Enhancement - Kannada

  • Morphological systems accuracy: 70%
  • Corpus annotation completion
  • POS and chunking improvements
🔄

Sampark Pipeline Enhancement

  • Kannada bi-directional: 65% accuracy
  • Domain adaptation for all language pairs
  • Hindi↔Tamil improvements
  • Tamil↔Malayalam improvements
  • Tamil↔Telugu improvements
🧠

NMT Refinement

  • Incorporation of linguistic features
  • Model training and tuning
  • Thorough evaluation
  • Kannada NMT model building

Data Validation

  • Sample corpus validation from DMU
  • Domain dictionary improvements
  • Benchmark data enhancement
🎯

Discourse Integration

  • Incorporation of discourse parameters to Sampark
  • Enhancement of resolution systems
  • Alpha version platform release
📱

Leaderboard Release

  • Alpha version of evaluation leaderboard
  • Public testing and feedback
  • Performance benchmarking
Year 3

Optimization & Deployment

Final refinements, deployment, and service launch

🎓

Peak Performance - Kannada

  • Morphological systems: 75% accuracy
  • Sampark bi-directional: 80% accuracy
  • Complete system optimization
🚀

Domain Adaptation Complete

  • Hindi↔Tamil: 90-95% accuracy
  • Tamil↔Malayalam: 90-95% accuracy
  • Tamil↔Telugu: 90-95% accuracy
  • All domain-specific optimizations
🏆

NMT Excellence

  • Kannada NMT: 25% BLEU score
  • Linguistic feature integration
  • Model training completion
  • Comprehensive evaluation

Pre & Post Processing

  • Development of required modules
  • Domain dictionary finalization
  • System integration
🌐

Platform Deployment

  • Final platform for discourse handling
  • System deployment
  • User feedback collection
  • Continuous improvement
📦

Final Deliverables

  • 1 Lakh sentences Tamil-Hindi parallel corpora
  • 50,000 sentences for 3 language pairs
  • 8 Sampark systems (4 language pairs)
  • 8 Hybrid NMT systems (4 language pairs)
  • Evaluation leaderboard release
  • Complete discourse platform

Target Achievements

🎯
90-95%
Translation Accuracy
For main language pairs
📚
1L+
Parallel Sentences
Tamil-Hindi corpus
🤖
8
MT Systems
Sampark + NMT
🏆
25%
BLEU Score
Kannada NMT target