Center for Tamil Natural Language Processing Research

Contact information, map and directions, contact form, opening hours, services, ratings, photos, videos and announcements from Center for Tamil Natural Language Processing Research, Education, 63, Sir Pon, Thirunelvelly, Ramanathan Road, Kallady, Jaffna.

Center for Tamil natural language processing research aims to research and develop natural language processing tools required for Tamil and to build an active scholarly network of people contributing to the advancement of the language. The Center for Tamil natural language processing research aims to research and develop natural language processing tools required for Tamil and to build an active scholarly network of people contributing to the advancement of the language.

15/06/2026

🚀𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗧𝗮𝗺𝗶𝗹 𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺 𝗳𝗼𝗿 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻

Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document.

𝗧𝗵𝗶𝘀 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 (𝗖𝗥) 𝗯𝗲𝗰𝗼𝗺𝗲𝘀 𝗲𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹.

Coreference Resolution is the task of determining whether multiple mentions within a document refer to the same real-world entity. It acts as a critical bridge between entity recognition and structured knowledge extraction.

𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀:

• Knowledge Graph Construction
• Relation Extraction
• Semantic Search
• Document Intelligence
• Retrieval-Augmented Generation (RAG)
• Conversational AI

Accurate coreference resolution is often the difference between fragmented information and coherent knowledge.

𝗧𝗵𝗲 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻 𝗣𝗿𝗼𝗯𝗹𝗲𝗺

During our work at CTNLPR, we observed a common challenge in Tamil document processing.

Documents rarely repeat the full entity name in every sentence. Instead, they rely on:

• Pronouns
• Possessive references
• Descriptive noun phrases
• Location references

Humans resolve these references naturally using context. Machines do not.

When we use coreference resolution The extracted knowledge becomes meaningful and directly usable within downstream systems.

𝗪𝗵𝘆 𝗪𝗲 𝗗𝗶𝗱 𝗡𝗼𝘁 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗡𝗲𝘂𝗿𝗮𝗹 𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗠𝗼𝗱𝗲𝗹𝘀

Most modern coreference systems rely on:

• Transformer-based architectures
• Mention-ranking models
• End-to-end neural systems
• Span-ranking approaches

While highly effective for English and other high-resource languages, they typically require:

• Large annotated datasets
• Extensive model training
• Significant computational resources
• Language-specific supervision

Tamil currently lacks large-scale publicly available coreference corpora.

Instead of waiting for benchmark datasets, we explored a different direction:

𝗕𝘂𝗶𝗹𝗱 𝗮 𝗱𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰, 𝗲𝘅𝗽𝗹𝗮𝗶𝗻𝗮𝗯𝗹𝗲, 𝗮𝗻𝗱 𝘁𝗮𝘀𝗸-𝗼𝗿𝗶𝗲𝗻𝘁𝗲𝗱 𝗰𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗳𝗼𝗿 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻.

Our goal was not to compete with neural benchmarks.

Our goal was to improve relation extraction quality in real-world Tamil document processing.

𝗦𝘆𝘀𝘁𝗲𝗺 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲

Tamil Document
↓
Text Normalization
↓
Sentence Segmentation
↓
Mention Detection
↓
Entity Normalization
↓
Entity Memory
↓
Coreference Resolution
↓
Coreference Chain Construction
↓
Visualization Layer

Each layer contributes toward discourse-level entity understanding.

𝗠𝗲𝗻𝘁𝗶𝗼𝗻 𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻

The mention detection layer combines multiple strategies:

• Named Entity Recognition (PERSON, LOCATION, ORGANIZATION)
• Pronoun Detection
• Location Reference Detection
• Rule-Based Noun Phrase Detection

These mentions become candidates for resolution.

𝗘𝗻𝘁𝗶𝘁𝘆 𝗡𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻

Tamil's rich morphology creates multiple surface forms for the same entity.

𝗘𝘅𝗮𝗺𝗽𝗹𝗲:

┌────────────────┐
யாழ்ப்பாணம்

யாழ்ப்பாணத்தில்

யாழ்ப்பாணத்தின்

யாழ்ப்பாணத்திற்கு
└────────────────┘

Using Stanza lemmatization:

┌─────────────────┐
யாழ்ப்பாணத்தில்
↓
யாழ்ப்பாணம்
└─────────────────┘

This reduces entity fragmentation and improves linking consistency.

𝗧𝗵𝗲 𝗘𝗻𝘁𝗶𝘁𝘆 𝗠𝗲𝗺𝗼𝗿𝘆 𝗟𝗮𝘆𝗲𝗿

One of the key design decisions was introducing a lightweight discourse memory.

Instead of neural antecedent scoring, the system maintains contextual entity state:

last_person
last_location
last_org

Whenever a new entity is detected, the corresponding memory state is updated.

This memory acts as the document's discourse context.

𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻

Once discourse memory has been established, the resolver links newly encountered mentions to previously observed canonical entities.

The system performs:

• Person Pronoun Resolution
• Possessive Resolution
• Location Resolution
• Rule-Based Noun Phrase Resolution

By maintaining discourse state across sentence boundaries, fragmented references are transformed into consistent entity representations.

This significantly improves downstream relation extraction quality.

𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗖𝗵𝗮𝗶𝗻 𝗖𝗼𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻

Rather than replacing mentions individually, the system groups related mentions into entity clusters.

consider this example:-

சாம்.ஏ.சபாபதி ஒரு பிரபல சமூக சேவகர். அவர் யாழ்ப்பாணத்தில் பல கல்வித் திட்டங்களை முன்னெடுத்தார். இந்த சமூகசேவகர் பல விருதுகளை பெற்றுள்ளார். அவர் யாழ்ப்பாணத்தில் பிறந்தார். அங்கு அவருக்கு பெரும் மதிப்பு இருந்தது. இவர் யாழ்ப்பாண நூலகத்தின் உருவாக்கத்தில் முக்கிய பங்காற்றினார்.

┌─────────────────────┐
Entity: சாம்.ஏ.சபாபதி

├── சாம்.ஏ.சபாபதி
├── அவர்
├── இந்த சமூகசேவகர்
├── அவர்
└── இவர்

Entity: யாழ்ப்பாணம்

├── யாழ்ப்பாணம்
└── அங்கு
└─────────────────────┘

These chains provide a document-level view of entity references.

Useful for:

• Debugging
• Evaluation
• Knowledge Graph Construction
• Relation Extraction

𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 & 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀

To support experimentation and validation, we developed a Streamlit-based visualization layer.

Users can:

• Submit Tamil documents
• Inspect generated coreference chains
• Analyze entity clusters
• Validate resolution decisions

This provides transparency into the resolution process and helps identify weaknesses in rule design.

𝗞𝗲𝘆 𝗜𝗻𝘀𝗶𝗴𝗵𝘁

𝗖𝗼𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗥𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗶𝘀 𝗻𝗼𝘁 𝗺𝗲𝗿𝗲𝗹𝘆 𝗮 𝗽𝗿𝗼𝗻𝗼𝘂𝗻-𝗿𝗲𝘀𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝘁𝗮𝘀𝗸.

It is an entity consistency layer that connects:

• Named Entity Recognition
• Relation Extraction
• Knowledge Graph Construction
• Semantic Search
• RAG Systems

Without coreference:

┌─────────────────┐
(இவர், பங்காற்றினார்
, யாழ்ப்பாண நூலகம்)
└─────────────────┘

With coreference:

┌───────────────────────┐
(சாம்.ஏ.சபாபதி, பங்காற்றினார்
, யாழ்ப்பாண நூலகம்)
└───────────────────────┘

The second representation is immediately usable within structured knowledge systems.

𝗦𝘆𝘀𝘁𝗲𝗺 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗮𝘁 𝗖𝗧𝗡𝗟𝗣𝗥

Current capabilities include:

✅ Named Entity Recognition
✅ Entity Normalization
✅ Pronoun Resolution
✅ Possessive Resolution
✅ Location Resolution
✅ Rule-Based Noun Phrase Resolution
✅ Coreference Chain Construction
✅ Streamlit-Based Visualization

The system acts as a foundational layer between entity extraction and knowledge graph generation.

𝗙𝘂𝘁𝘂𝗿𝗲 𝗪𝗼𝗿𝗸

• Multi-entity discourse memory
• Entity salience tracking
• Advanced noun phrase resolution
• Relation-aware coreference resolution
• Knowledge graph integration
• Hybrid neural-rule architectures

𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻

Building effective Tamil Information Extraction systems requires more than Named Entity Recognition.

By introducing a dedicated coreference resolution layer, we can maintain entity consistency across documents, improve relation extraction quality, and generate more reliable structured knowledge.

For low-resource languages such as Tamil, carefully designed rule-based systems remain a practical and effective pathway toward document-level semantic understanding while larger neural approaches continue to mature.

03/06/2026

⚡️𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗜𝗻𝗱𝗶𝗰𝗡𝗘𝗥 𝗳𝗼𝗿 𝗦𝗿𝗶 𝗟𝗮𝗻𝗸𝗮𝗻 𝗧𝗮𝗺𝗶𝗹 𝗡𝗮𝗺𝗲𝗱 𝗘𝗻𝘁𝗶𝘁𝘆 𝗥𝗲𝗰𝗼𝗴𝗻𝗶𝘁𝗶𝗼𝗻

Transformer-based multilingual NLP systems have significantly improved Named Entity Recognition (NER) across many languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation.

At CTNLPR, we fine-tuned 𝗮𝗶𝟰𝗯𝗵𝗮𝗿𝗮𝘁/𝗜𝗻𝗱𝗶𝗰𝗡𝗘𝗥 specifically for Sri Lankan Tamil using a custom annotated NER corpus.

𝗢𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲

Improve entity recognition for:

• Sri Lankan Tamil linguistic patterns
• Local person, location, and organization names
• Morphology-aware contextual variations

𝗪𝗵𝘆 𝗦𝗿𝗶 𝗟𝗮𝗻𝗸𝗮𝗻 𝗧𝗮𝗺𝗶𝗹 𝗡𝗘𝗥 𝗶𝘀 𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗶𝗻𝗴

Most multilingual NER systems are trained primarily on:

• General web corpora
• Indian Tamil datasets
• Multilingual benchmark datasets
• Formal textual sources

When applied to Sri Lankan Tamil, they often struggle with:

• Regional naming conventions
• Local organization terminology
• Morphological suffix complexity
• OCR-induced token inconsistencies
• Subword tokenization fragmentation
• Ambiguous entity boundaries

These limitations directly affect downstream systems such as:

• Semantic Search
• Document Intelligence
• Knowledge Graph Construction
• Tamil Chatbots
• RAG Systems
• Government Document Processing

𝗠𝗼𝗱𝗲𝗹 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗢𝘃𝗲𝗿𝘃𝗶𝗲𝘄

𝗕𝗮𝘀𝗲 𝗠𝗼𝗱𝗲𝗹

→ ai4bharat/IndicNER

𝗘𝗻𝘁𝗶𝘁𝘆 𝗧𝘆𝗽𝗲𝘀

• PERSON
• LOCATION
• ORGANIZATION

𝗞𝗲𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻𝘀

✅ Tamil-safe Tokenization
✅ Unicode Normalization
✅ BIO Tagging
✅ Proper Subword Label Alignment
✅ Morphology-aware Training
✅ OCR-aware Preprocessing

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀

1️⃣ 𝗧𝗮𝗺𝗶𝗹 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻

Tamil is morphologically rich. Incorrect tokenization can cause:

• Broken entity spans
• Incorrect BIO labels
• Fragmented predictions

2️⃣ 𝗦𝘂𝗯𝘄𝗼𝗿𝗱 𝗟𝗮𝗯𝗲𝗹 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁

Transformer tokenizers frequently split Tamil words into multiple subword units.

Without proper alignment:

• Entity spans become corrupted
• BIO labels mismatch
• Training instability increases

3️⃣ 𝗢𝗖𝗥 𝗡𝗼𝗶𝘀𝗲

Tamil OCR systems still generate:

• Grapheme inconsistencies
• Merged tokens
• Invalid Unicode combinations
• Punctuation corruption

Therefore OCR-aware normalization was integrated before training.

𝗠𝗼𝗱𝗲𝗹 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻

𝗢𝘃𝗲𝗿𝗮𝗹𝗹 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲

• F1 Score → 0.650
• Precision → 0.602
• Recall → 0.707
• Accuracy → 96.04%

𝗘𝗻𝘁𝗶𝘁𝘆-𝘄𝗶𝘀𝗲 𝗙𝟭

• PERSON → 0.721
• LOCATION → 0.698
• ORGANIZATION → 0.484

PERSON and LOCATION categories achieved relatively strong performance, while ORGANIZATION entities remain the most challenging category.

𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗘𝘅𝗮𝗺𝗽𝗹𝗲𝘀

𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝟭

Sentence:
"பாரதிதாசன் எழுதிய நூலை பாரதி பதிப்பகம் வெளியிட்டது."

Output:
👤 PERSON → பாரதிதாசன்
🏢 ORGANIZATION → பாரதி பதிப்பகம்

𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝟮

Sentence:
"வடமராட்சி தொழில்நுட்ப நிறுவனம் மாணவர்களை சேர்த்தது."

Output:
🏢 ORGANIZATION → வடமராட்சி தொழில்நுட்ப நிறுவனம்

𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝟯

Sentence:
"நவமணி கிராமம் வெள்ளத்தால் பாதிக்கப்பட்டது."

Output:
📍 LOCATION → நவமணி

𝗘𝘅𝗮𝗺𝗽𝗹𝗲 𝟰

Sentence:
"ஜே.ஏ.எஸ்.பீ. ஜயசிங்க நவமணி கிராமத்திற்கு சென்றார்."

Output:
👤 PERSON → ஜே.ஏ.எஸ்.பீ. ஜயசிங்க
📍 LOCATION → நவமணி

𝗞𝗲𝘆 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝘁𝗶𝗼𝗻

One of the most important findings from this work is:

"Better preprocessing and domain-specific data can be as important as model architecture."

For low-resource languages like Sri Lankan Tamil:

• High-quality annotations matter
• OCR normalization matters
• Tokenizer alignment matters
• Linguistic preprocessing matters

Large transformer architectures alone are not sufficient without carefully prepared language-specific datasets.

𝗔𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀

• Tamil NER Systems
• Semantic Search
• RAG Pipelines
• OCR Information Extraction
• Knowledge Graph Construction
• Tamil Chatbots

This work is part of ongoing Tamil NLP research at CTNLPR aimed at building stronger NLP infrastructure for low-resource Tamil language technologies.

25/05/2026

🌟𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐚 𝐒𝐫𝐢 𝐋𝐚𝐧𝐤𝐚𝐧 𝐓𝐚𝐦𝐢𝐥 𝐍𝐚𝐦𝐞𝐝 𝐄𝐧𝐭𝐢𝐭𝐲 𝐑𝐞𝐜𝐨𝐠𝐧𝐢𝐭𝐢𝐨𝐧 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐋𝐨𝐰-𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞 𝐍𝐋𝐏

The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasets—especially for foundational tasks like Named Entity Recognition (NER).

To address this gap, we developed the Srilankan-Tamil-NER Dataset, a Tamil NER dataset designed specifically for Sri Lankan Tamil linguistic and contextual usage.

This dataset is intended to support:

• Tamil NER research
• Indic language fine-tuning
• Information extraction systems
• Retrieval-Augmented Generation (RAG)
• Tamil LLM adaptation
• Domain-specific AI systems for Sri Lanka

𝗪𝗵𝘆 𝗦𝗿𝗶 𝗟𝗮𝗻𝗸𝗮𝗻 𝗧𝗮𝗺𝗶𝗹 𝗡𝗘𝗥 𝗠𝗮𝘁𝘁𝗲𝗿𝘀

Named Entity Recognition (NER) is a core NLP task that identifies and classifies entities such as:

• Person names
• Locations
• Organizations
• Dates
• Miscellaneous entities

NER acts as a foundational layer for many downstream NLP systems including:

• Question answering
• Search systems
• Chatbots
• Document intelligence
• Machine translation
• Knowledge graph generation

For Tamil — particularly Sri Lankan Tamil — publicly available annotated corpora remain extremely limited. Existing multilingual datasets often underrepresent regional linguistic variations, local named entities, and culturally contextual terminology.

Most existing NER systems for Tamil are trained on datasets originating from Indian Tamil corpora, leaving significant gaps in handling:

• Sri Lankan Tamil vocabulary
• Local organization names
• Sri Lankan place names
• Government and institutional terminology

Our dataset aims to bridge this gap.

𝗔𝗯𝗼𝘂𝘁 𝘁𝗵𝗲 𝗗𝗮𝘁𝗮𝘀𝗲𝘁

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗡𝗮𝗺𝗲:
Srilankan-Tamil-NER Dataset

The primary goal of this dataset is to create a high-quality manually curated Named Entity Recognition corpus for Sri Lankan Tamil under CTNLPR.

The dataset is structured to support fine-tuning transformer-based multilingual models such as:

• IndicNER
• mBERT
• XLM-RoBERTa
• MuRIL
• IndicBERT

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀

• B-PER (Person): 4,533
• B-LOC (Location): 8,110
• B-ORG (Organization): 3,369
• Total Entities: 16,012

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗣𝗿𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗼𝗻 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲

Creating a Tamil NER dataset involves significantly more than simple annotation.

The preparation workflow included multiple stages:

1. 𝑫𝒂𝒕𝒂 𝑪𝒐𝒍𝒍𝒆𝒄𝒕𝒊𝒐𝒏

The raw Tamil text corpus was collected from the Noolaham corpus and other relevant publicly available Sri Lankan Tamil textual sources.

Special attention was given to:

• Local linguistic relevance
• Entity diversity
• Sentence quality
• Contextual richness

The objective was to capture realistic Sri Lankan Tamil usage patterns rather than synthetic or translated text.

2. 𝑶𝑪𝑹 𝒂𝒏𝒅 𝑻𝒆𝒙𝒕 𝑵𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒂𝒕𝒊𝒐𝒏

Tamil NLP pipelines often begin with scanned or image-based documents.

As part of our broader Tamil document intelligence workflow, OCR-extracted Tamil text underwent:

• Unicode normalization
• Punctuation cleaning
• Whitespace normalization
• Invalid character filtering
• OCR noise reduction

OCR-related preprocessing becomes extremely important because Tamil script errors can propagate heavily into token classification systems.

3. 𝑵𝒂𝒎𝒆𝒅 𝑬𝒏𝒕𝒊𝒕𝒚 𝑨𝒏𝒏𝒐𝒕𝒂𝒕𝒊𝒐𝒏

The dataset was manually annotated using BIO tagging format.

Entity Types:

• B-PER — Beginning of person entity
• I-PER — Inside person entity
• B-LOC — Beginning of location entity
• I-LOC — Inside location entity
• B-ORG — Beginning of organization entity
• I-ORG — Inside organization entity
• O — Non-entity token

Example:

இராமநாதன் → B-PER
யாழ்ப்பாணம் → B-LOC
பல்கலைக்கழகம் → B-ORG

𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀 𝗶𝗻 𝗦𝗿𝗶 𝗟𝗮𝗻𝗸𝗮𝗻 𝗧𝗮𝗺𝗶𝗹 𝗡𝗘𝗥

Building a Tamil NER dataset introduced several language-specific challenges.

• Morphological complexity
• OCR noise
• Unicode inconsistencies
• Token boundary detection
• Subword alignment
• Limited benchmark corpora

𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲𝘀

This dataset can support:

• Tamil NER
• OCR post-processing
• Semantic search systems
• RAG pipelines
• Tamil chatbots
• Government document AI
• Knowledge graph generation

The Srilankan-Tamil-NER Dataset, developed under CTNLPR, represents an important step toward strengthening the Sri Lankan Tamil NLP ecosystem through high-quality entity annotation and linguistically relevant corpus preparation.

#𝑆𝑟𝑖𝐿𝑎𝑛𝑘𝑎𝑛𝑇𝑎𝑚𝑖𝑙 #𝑇𝑎𝑚𝑖𝑙𝑁𝐸𝑅 #𝑇𝑎𝑚𝑖𝑙𝑁𝐿𝑃 #𝑁𝑎𝑚𝑒𝑑𝐸𝑛𝑡𝑖𝑡𝑦𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 #𝐿𝑜𝑤𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑁𝐿𝑃 #𝐼𝑛𝑑𝑖𝑐𝑁𝐿𝑃 #𝑀𝑢𝑅𝐼𝐿 #𝑚𝐵𝐸𝑅𝑇 #𝑋𝐿𝑀𝑅 #𝐵𝐼𝑂𝑡𝑎𝑔𝑔𝑖𝑛𝑔 #𝐿𝐿𝑀 #𝑅𝐴𝐺 #𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑆𝑒𝑎𝑟𝑐ℎ #𝐾𝑛𝑜𝑤𝑙𝑒𝑑𝑔𝑒𝐺𝑟𝑎𝑝ℎ𝑠 #𝐴𝐼𝐸𝑛𝑔𝑖𝑛𝑒𝑒𝑟𝑖𝑛𝑔 #𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝐼𝑛𝑡𝑒𝑙𝑙𝑖𝑔𝑒𝑛𝑐𝑒 #𝐸𝑛𝑡𝑖𝑡𝑦𝐸𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 #𝑂𝐶𝑅 #𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑀𝑜𝑑𝑒𝑙𝑠 #𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝐿𝑖𝑛𝑔𝑢𝑖𝑠𝑡𝑖𝑐𝑠 #𝐶𝑇𝑁𝐿𝑃𝑅

15/05/2026

⚡️ 𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐚 𝐓𝐚𝐦𝐢𝐥 𝐂𝐨𝐫𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐑𝐞𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧 𝐒𝐲𝐬𝐭𝐞𝐦: 𝐙𝐞𝐫𝐨-𝐒𝐡𝐨𝐭 𝐂𝐨𝐧𝐭𝐞𝐱𝐭𝐮𝐚𝐥 𝐌𝐨𝐝𝐞𝐥𝐢𝐧𝐠 𝐰𝐢𝐭𝐡 𝐌𝐮𝐑𝐈𝐋

Coreference Resolution (CR) is a critical NLP task for identifying whether multiple mentions in a document refer to the same real-world entity. It plays a major role in:

• Knowledge Graph Construction
• Relation Extraction
• Semantic Search
• RAG Systems
• Conversational AI
• Document-Level Understanding

While English NLP already has mature coreference systems and libraries, Tamil remains a highly challenging low-resource language for discourse-level semantic modeling.

At CTNLPR, we explored how modern multilingual coreference architectures can be adapted for Tamil using zero-shot contextual semantic modeling instead of heavily supervised pipelines.

Tamil introduces several difficult linguistic challenges:

• Agglutinative morphology
• Free word order
• Pronoun dropping
• Implicit subject references
• Rich inflectional structures
• Long-distance discourse dependencies
• Noun-to-noun semantic references

Additionally:

• No dedicated Tamil coreference libraries currently exist publicly
• Large annotated Tamil CR datasets are unavailable
• Most multilingual systems remain heavily English-biased

⚙️ Our Architecture

The architecture currently being explored at CTNLPR uses:

• MuRIL-based contextual embeddings
• Span-based mention detection
• Contextual span representations
• Cosine similarity-based semantic linking
• Agglomerative clustering

Instead of manually defining antecedents, the system automatically generates semantic mention spans from Tamil text and groups semantically related mentions into discourse-level entity chains using contextual similarity.

🧠 Key Technical Direction

Traditional supervised coreference systems depend heavily on:

• Large annotated corpora
• Expensive training pipelines
• Language-specific supervision
• Antecedent ranking architectures
• High computational cost

For Tamil, these resources are extremely limited.

Our approach avoids heavy annotation dependency while still leveraging multilingual transformer-based semantic understanding learned from Indian-language pretraining.

Each candidate span is encoded using contextual embeddings generated from MuRIL, and span-level semantic representations are constructed using contextual token pooling. The system then performs:

• Heuristic span pruning
• Semantic similarity computation
• Similarity-driven clustering

to generate discourse-level coreference chains.

🚀 Key Advantages

• Zero-shot inference
• Low-resource scalability
• Context-aware semantic reasoning
• Better adaptation to Tamil morphology
• Lightweight unsupervised inference
• Reduced annotation dependency

This architecture is being explored at CTNLPR as a foundation for:

• Tamil discourse understanding
• Entity-aware semantic linking
• Knowledge Graph Construction
• Ontology-aware NLP
• Multilingual semantic reasoning
• Advanced RAG systems

🔬 Building document-level semantic understanding for Tamil is one of the next major steps toward scalable low-resource AI systems.

04/05/2026

⚡𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐚 𝐌𝐨𝐫𝐩𝐡𝐨𝐥𝐨𝐠𝐲-𝐀𝐰𝐚𝐫𝐞 𝐓𝐚𝐦𝐢𝐥 𝐍𝐄𝐑 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞: 𝐅𝐫𝐨𝐦 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫 𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐭𝐨 𝐂𝐚𝐧𝐨𝐧𝐢𝐜𝐚𝐥 𝐄𝐧𝐭𝐢𝐭𝐲 𝐑𝐞𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧

Named Entity Recognition (NER) is a critical layer in our Tamil NLP stack (search, indexing, knowledge graph construction, and RAG).
However, for Tamil, extracting entities is only half the problem — canonicalizing them is the real challenge.

💥 What We Built in CTNLPR

We designed a Tamil-aware NER pipeline by extending a transformer-based model with morphological normalization:

• ai4bharat/IndicNER → baseline entity extraction
• Custom span merging → IOB consolidation
• Prefix-based grouping → variant clustering
• Morphological normalization layer → canonical entity resolution

Since Indic NER models are not morphology-aware, we explicitly evaluated and integrated normalization strategies.

🔰Method Exploration (What We Tried)

• IndicNER (Transformer baseline)
✅ Strong recall across entity types
❎ Produces multiple inflected variants of the same entity

• Prefix-based grouping
✅ Fast heuristic clustering
❎ Not linguistically grounded

• UoM Thamizhi Morphological Normalizer (University of Moratuwa)
✅ Linguistically motivated rule-based approach
❎ Limited effectiveness on real-world data
❎ Struggled with:

* Noisy OCR text
* Complex suffix chains
* Unseen word forms

• Tamil Lemmatizer (final approach)
✅ Consistent root-form extraction
✅ Robust across inflected variants
✅ Best empirical performance in our pipeline

🔬 Key Design Decision

Transformer models do not enforce canonical forms.

👉 Surface forms like:
• இலங்கையில்
• இலங்கையிலும்
• இலங்கையிலே

are extracted as separate entities

👉 After normalization:
• இலங்கை

This enables many-to-one mapping, critical for system consistency.

🧩 Our Setup

Pipeline:

• Document → chunking
• Transformer inference (IndicNER)
• IOB span merging + filtering
• Variant aggregation (prefix-based)

• Morphological normalization (UoM explored → Lemmatizer selected)
• Entity re-indexing

Example:

இலங்கையில் → இலங்கை
இலங்கையிலும் → இலங்கை
இந்தியாவில் → இந்தியா

➡️ System-Level Challenges We Solved

• Agglutinative suffix handling
• Variant explosion in entity outputs
• OCR/noisy input robustness
• Canonical entity consistency across documents

📊 What We Observed

• Transformer NER → high recall, low canonical consistency
• UoM morphological normalizer → linguistically sound but limited robustness
• Lemmatizer → best normalization performance in practice

🌟 Final system:
IndicNER + Lemmatization (hybrid architecture)

✳️ Key Insight

In Tamil NER, the challenge is not detection —
it is morphological normalization.

NER output ≠ final entity

🌟 Canonicalization is essential for:
• Indexing
• Entity linking
• Knowledge graphs
• RAG systems

🚀 Outcome

We built a production-ready Tamil NER system that:

• Resolves inflected entity variants
• Produces stable canonical forms
• Improves downstream retrieval and analytics
• Scales across multi-document pipelines

🔬 This work is part of ongoing Tamil NLP system development at CTNLPR

24/04/2026

🚀 𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐚 𝐊𝐞𝐲𝐰𝐨𝐫𝐝 𝐄𝐱𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 𝐟𝐨𝐫 𝐓𝐚𝐦𝐢𝐥: 𝐅𝐫𝐨𝐦 𝐒𝐭𝐚𝐭𝐢𝐬𝐭𝐢𝐜𝐚𝐥 𝐌𝐞𝐭𝐡𝐨𝐝𝐬 𝐭𝐨 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠-𝐁𝐚𝐬𝐞𝐝 𝐌𝐨𝐝𝐞𝐥𝐬

Keyword extraction is a core component in our NLP pipeline (search, indexing, and RAG).In practice, adapting existing methods for Tamil required careful system-level design, not just model selection.

⚙️ What We Built

We implemented a Tamil-aware keyword extraction pipeline by adapting standard NLP libraries:

• scikit-learn → TF-IDF (statistical baseline)
• Gensim → TextRank (graph-based ranking)
• KeyBERT → embedding-based semantic extraction

Since these tools are not natively designed for Tamil, we integrated preprocessing using Indic NLP techniques (tokenization, normalization).

🧠 Method Evaluation (What Worked / What Didn’t)

• TF-IDF
✅ Useful for corpus-level keyword distribution
❎ No semantic understanding

• TextRank
✅ Works without training
❎ Highly sensitive to tokenization quality

• YAKE
✅ Fast, strong baseline for per-document keywords

• KeyBERT (final approach)
✅ Captures semantic relevance
✅ Best performance for Tamil when paired with proper embeddings

🔬 Key Design Decision

KeyBERT itself is not language-aware —it depends entirely on the embedding model.

👉 Using default English embeddings → poor Tamil results

👉 Using Tamil/Indic embeddings → strong semantic extraction

🧩 Our Setup

We integrated KeyBERT with Tamil-capable embedding models:

• l3cube-pune/tamil-sentence-bert-nli
• ai4bharat/indic-bert
•paraphrase-multilingual-mpnet-base-v2

Pipeline:

• Document → embedding
• Candidate n-grams generation
• Semantic similarity ranking

🛑 System-Level Challenges We Solved

• Tamil stopword handling (custom lists)
• Text normalization (spelling variations, diacritics)
• Tokenization consistency
• Handling low-resource language constraints

📊 What We Observed

• TF-IDF → strong for global topic words
• YAKE → reliable lightweight baseline
• KeyBERT + Tamil embeddings → best semantic keyword quality

💡 Key Insight

In Tamil NLP, keyword extraction is not limited by the algorithm —it is constrained by:

• Embedding quality
• Tokenization
• Text normalization

🚀 Outcome

We built a production-ready Tamil keyword extraction pipeline that:
• Produces semantically meaningful keywords
• Works across different document types
• Integrates seamlessly into downstream RAG systems

🔬 This work is part of ongoing Noolaham GPT development at CTNLPR.

15/04/2026

⚡ 𝐃𝐞𝐬𝐢𝐠𝐧𝐢𝐧𝐠 𝐚𝐧 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐑𝐞𝐫𝐚𝐧𝐤𝐢𝐧𝐠 𝐋𝐚𝐲𝐞𝐫: 𝐌𝐮𝐥𝐭𝐢𝐥𝐢𝐧𝐠𝐮𝐚𝐥 𝐂𝐫𝐨𝐬𝐬-𝐄𝐧𝐜𝐨𝐝𝐞𝐫 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐟𝐨𝐫 𝐓𝐚𝐦𝐢𝐥–𝐄𝐧𝐠𝐥𝐢𝐬𝐡 𝐑𝐀𝐆

In multilingual RAG systems, dense retrieval can surface relevant chunks, but retrieval alone is not sufficient.
Not all retrieved passages are equally relevant, and passing all candidates directly to the LLM leads to:

• Increased token usage
• Higher latency
• Noisy context → degraded response quality

🔍 Problem

Dense retrievers often fail at ranking precision, especially for mixed-language queries (Tamil, English).

This results in:

• Relevant documents ranked lower
• Cross-lingual inconsistencies
• Reduced downstream LLM answer quality

⚙️ Core Approach

At CTNLPR, we introduce a cross-encoder reranking layer to refine retrieval results.

Unlike bi-encoders, rerankers:

• Jointly encode query–document pairs
• Capture fine-grained semantic relevance
• Improve cross-lingual ranking consistency

👉 This enables accurate ordering of multilingual candidates before generation.

🔬 Model Evaluation

We evaluated multiple multilingual rerankers:

• BGE-v2-m3 → high accuracy, higher latency on CPU
• jina-v3-multi → strong cross-lingual consistency
• jina-v2-cpu-opt → best latency–quality trade-off
• gte-multilingual → stable performance

Without reranking, we observed:

• Correct documents retrieved but mis-ranked
• Ranking instability for mixed-language queries
• Noise introduced by lexical fusion methods (e.g., RRF)

🧩 Reranking Pipeline

We adopt a two-stage architecture:

1. Retrieve Top-K candidates (dense retrieval)
2. Apply cross-encoder reranker
3. Score and reorder candidates
4. Pass Top-N results to LLM

⚡ CPU Optimization Strategy

Cross-encoders are computationally expensive, especially in CPU-only environments.
Our objective: maximize ranking quality under strict latency constraints.

1️⃣ Candidate Reduction (High Impact)
• Reduce Top-K before reranking (e.g., 100 → 20)
• Directly minimizes forward passes

💡 Largest performance gain comes from reducing reranker calls

2️⃣ ONNX + INT8 Quantization

• Convert PyTorch → ONNX
• Apply INT8 dynamic quantization

Benefits:
• Faster inference
• Lower memory usage
• Minimal impact on ranking quality

3️⃣ Token & Runtime Optimization

• Reduce max token length (512 → 256)
• Optimize CPU threading (OMP / MKL)
• Use efficient tokenization + batching

💡 Self-attention scales as O(n²), making token reduction critical

📊 Performance Signals

• Latency reduced from seconds → sub-second range (~100× improvement)
• Maintained strong ranking quality (MRR / nDCG)
• Stable cross-lingual ranking (Tamil ↔ English)

What Didn’t Work

• Similarity threshold filtering → unstable across scripts
• RRF (Reciprocal Rank Fusion) → introduces lexical noise

💡 Key Insight

Multilingual RAG is not just a retrieval problem —
it is a ranking precision problem.

• Retrieval → ensures coverage
• Reranking → ensures correctness

🚀 Outcome

• Improved ranking accuracy across languages
• Reduced CPU latency to production-ready levels
• Efficient, scalable multilingual pipeline
• Better handling of mixed-language queries

Multilingual RAG becomes reliable when retrieval and reranking are jointly optimized.

At CTNLPR, we designed and deployed this reranking layer as part of our Tamil–English RAG pipeline, focusing on CPU-efficient cross-lingual ranking for real-world, large-scale document systems.

09/04/2026

⚡️𝐃𝐞𝐬𝐢𝐠𝐧𝐢𝐧𝐠 𝐚 𝐁𝐢𝐥𝐢𝐧𝐠𝐮𝐚𝐥 𝐑𝐀𝐆 𝐒𝐲𝐬𝐭𝐞𝐦: 𝐂𝐫𝐨𝐬𝐬-𝐋𝐢𝐧𝐠𝐮𝐚𝐥 𝐃𝐞𝐧𝐬𝐞 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐟𝐨𝐫 𝐓𝐚𝐦𝐢𝐥–𝐄𝐧𝐠𝐥𝐢𝐬𝐡

In multilingual RAG systems, the key challenge is cross-lingual retrieval — enabling a query in Tamil to retrieve semantically relevant Tamil and English passages from a unified index (and vice versa), without translation pipelines or language-specific partitioning.

⚙️ Core Approach

We rely on multilingual dense encoders that project Tamil and English into a shared semantic vector space, allowing semantically aligned content across languages to be retrieved using standard similarity search.

🔬 Model Evaluation

We evaluated:
• Sentence Transformers (SBERT variants)
• Indic-specific models (IndicBERT, MuRIL)

Observed limitations:
• Weak Tamil–English alignment
• Inconsistent cross-lingual similarity distributions
• Lower recall in mixed-language retrieval

✅ Selected Model

→ intfloat/multilingual-e5-large

Reasons:
• Built on XLM-RoBERTa-large (multilingual pretraining)
• Trained with large-scale contrastive objectives (>1B pairs)
• Fine-tuned on retrieval benchmarks (MS MARCO, Mr.TyDi, MIRACL)
• Instruction-aware embedding (“query:” / “passage:” prefixes)

This results in strong cross-lingual ranking and alignment, especially for low-resource languages.

🧩 Indexing Strategy

We use a unified embedding + single index design:
• Chunk all documents (Tamil + English)
• Encode using the same model
• Store in one vector index
No language-based partitioning.

🔎 Retrieval Flow

1.Encode query (Tamil or English)
2 Perform ANN search (cosine similarity)
3.Retrieve top-k cross-lingual chunks
4.Pass to LLM for response synthesis

📊 Benchmark Signals (MRR / nDCG)

Across multilingual benchmarks and internal evaluations:
• MRR@10 ↑ → better early precision in cross-lingual retrieval
• nDCG@10 ↑ → improved ranking quality for mixed-language queries
• Recall@10 ↑ → higher retrieval coverage (Tamil ↔ English)
• More stable cosine similarity distributions across scripts

These gains are primarily driven by large-scale contrastive training + retrieval-specific fine-tuning.

💡 Key Insight

Cross-lingual RAG is not a database problem —it is an embedding alignment problem solved at training time.

🚀 Outcome

• Stronger cross-lingual ranking (Mean Reciprocal Rank/nDCG improvements)
• No translation overhead
• Single index, reduced system complexity
• Better knowledge coverage across languages

Multilingual retrieval becomes reliable when both languages share the same semantic space.

Claim ownership or report listing

Want your school to be the top-listed School/college in Jaffna?

Click here to claim your Sponsored Listing.

Location

Jaffna

Contact the school

Click here to send a message to the school

Telephone

+442037733854

Website

http://www.ctnlpr.com/

Address

63, Sir Pon, Thirunelvelly, Ramanathan Road, Kallady
Jaffna
40000

Opening Hours

Monday	09:00 - 17:00
Tuesday	09:00 - 17:00
Wednesday	09:00 - 17:00
Thursday	09:00 - 17:00
Friday	09:00 - 17:00