📋
Documentation & Planning
- Project documentation and System Required Specification (SRS)
- Guidelines and standards for corpora creation
- Evaluation methods development
📚
Data Collection
- 1 Lakh sentences for Tamil (25K per domain)
- 1 Lakh sentences for Tamil-Hindi parallel corpus
- 50,000 sentences for Tamil-Kannada, Malayalam, Telugu pairs
- 25K Kannada monolingual corpora creation
🔧
System Development - Kannada
- Morphological Analyser (60% accuracy)
- POS tagging system
- Chunk and clause boundary detection
- Transfer grammar
- Morphological generator
⚙️
Sampark Pipeline
- Tamil-Kannada system (55% accuracy)
- Enhancement for Hindi↔Tamil
- Enhancement for Tamil↔Malayalam
- Enhancement for Tamil↔Telugu
🤖
NMT Development
- Model building for Hindi-Tamil (bi-directional)
- Model building for Tamil-Malayalam (bi-directional)
- Model building for Tamil-Telugu (bi-directional)
📊
Evaluation Infrastructure
- Benchmark data for selected domains
- Domain dictionaries development
- Evaluation leaderboard development
💬
Discourse Analysis - Phase 1
- Annotation guidelines and tagset
- Automatic discourse chunk tagging system
- 70,000 coherent chunks (Hindi & Tamil)
- 35,000 coherent chunks (Kannada, Malayalam, Telugu)
🔗
Resolution Systems
- Anaphora resolution (Pronominal)
- Co-reference resolution (Pronominal, Noun-Noun, Definite)
- Connective resolution
- Conversation analysis (Ellipsis & Gaps)
- Platform development for discourse handling
📈
System Enhancement - Kannada
- Morphological systems accuracy: 70%
- Corpus annotation completion
- POS and chunking improvements
🔄
Sampark Pipeline Enhancement
- Kannada bi-directional: 65% accuracy
- Domain adaptation for all language pairs
- Hindi↔Tamil improvements
- Tamil↔Malayalam improvements
- Tamil↔Telugu improvements
🧠
NMT Refinement
- Incorporation of linguistic features
- Model training and tuning
- Thorough evaluation
- Kannada NMT model building
✅
Data Validation
- Sample corpus validation from DMU
- Domain dictionary improvements
- Benchmark data enhancement
🎯
Discourse Integration
- Incorporation of discourse parameters to Sampark
- Enhancement of resolution systems
- Alpha version platform release
📱
Leaderboard Release
- Alpha version of evaluation leaderboard
- Public testing and feedback
- Performance benchmarking
🎓
Peak Performance - Kannada
- Morphological systems: 75% accuracy
- Sampark bi-directional: 80% accuracy
- Complete system optimization
🚀
Domain Adaptation Complete
- Hindi↔Tamil: 90-95% accuracy
- Tamil↔Malayalam: 90-95% accuracy
- Tamil↔Telugu: 90-95% accuracy
- All domain-specific optimizations
🏆
NMT Excellence
- Kannada NMT: 25% BLEU score
- Linguistic feature integration
- Model training completion
- Comprehensive evaluation
⚡
Pre & Post Processing
- Development of required modules
- Domain dictionary finalization
- System integration
🌐
Platform Deployment
- Final platform for discourse handling
- System deployment
- User feedback collection
- Continuous improvement
📦
Final Deliverables
- 1 Lakh sentences Tamil-Hindi parallel corpora
- 50,000 sentences for 3 language pairs
- 8 Sampark systems (4 language pairs)
- 8 Hybrid NMT systems (4 language pairs)
- Evaluation leaderboard release
- Complete discourse platform
🎯
90-95%
Translation Accuracy
For main language pairs
📚
1L+
Parallel Sentences
Tamil-Hindi corpus
🤖
8
MT Systems
Sampark + NMT
🏆
25%
BLEU Score
Kannada NMT target