15/06/2026
đđđđļđšđąđļđģđ´ đŽ đ§đŽđēđļđš đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đĨđ˛đđŧđšđđđļđŧđģ đĻđđđđ˛đē đŗđŧđŋ đđģđŗđŧđŋđēđŽđđļđŧđģ đđ
đđŋđŽđ°đđļđŧđģ
Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document.
đ§đĩđļđ đļđ đđĩđ˛đŋđ˛ đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đĨđ˛đđŧđšđđđļđŧđģ (đđĨ) đ¯đ˛đ°đŧđēđ˛đ đ˛đđđ˛đģđđļđŽđš.
Coreference Resolution is the task of determining whether multiple mentions within a document refer to the same real-world entity. It acts as a critical bridge between entity recognition and structured knowledge extraction.
đđŊđŊđšđļđ°đŽđđļđŧđģđ:
âĸ Knowledge Graph Construction
âĸ Relation Extraction
âĸ Semantic Search
âĸ Document Intelligence
âĸ Retrieval-Augmented Generation (RAG)
âĸ Conversational AI
Accurate coreference resolution is often the difference between fragmented information and coherent knowledge.
đ§đĩđ˛ đđģđŗđŧđŋđēđŽđđļđŧđģ đđ
đđŋđŽđ°đđļđŧđģ đŖđŋđŧđ¯đšđ˛đē
During our work at CTNLPR, we observed a common challenge in Tamil document processing.
Documents rarely repeat the full entity name in every sentence. Instead, they rely on:
âĸ Pronouns
âĸ Possessive references
âĸ Descriptive noun phrases
âĸ Location references
Humans resolve these references naturally using context. Machines do not.
When we use coreference resolution The extracted knowledge becomes meaningful and directly usable within downstream systems.
đĒđĩđ đĒđ˛ đđļđą đĄđŧđ đĻđđŽđŋđ đđļđđĩ đĄđ˛đđŋđŽđš đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đ đŧđąđ˛đšđ
Most modern coreference systems rely on:
âĸ Transformer-based architectures
âĸ Mention-ranking models
âĸ End-to-end neural systems
âĸ Span-ranking approaches
While highly effective for English and other high-resource languages, they typically require:
âĸ Large annotated datasets
âĸ Extensive model training
âĸ Significant computational resources
âĸ Language-specific supervision
Tamil currently lacks large-scale publicly available coreference corpora.
Instead of waiting for benchmark datasets, we explored a different direction:
đđđļđšđą đŽ đąđ˛đđ˛đŋđēđļđģđļđđđļđ°, đ˛đ
đŊđšđŽđļđģđŽđ¯đšđ˛, đŽđģđą đđŽđđ¸-đŧđŋđļđ˛đģđđ˛đą đ°đŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đŗđŋđŽđēđ˛đđŧđŋđ¸ đŧđŊđđļđēđļđđ˛đą đŗđŧđŋ đđģđŗđŧđŋđēđŽđđļđŧđģ đđ
đđŋđŽđ°đđļđŧđģ.
Our goal was not to compete with neural benchmarks.
Our goal was to improve relation extraction quality in real-world Tamil document processing.
đĻđđđđ˛đē đđŋđ°đĩđļđđ˛đ°đđđŋđ˛
Tamil Document
â
Text Normalization
â
Sentence Segmentation
â
Mention Detection
â
Entity Normalization
â
Entity Memory
â
Coreference Resolution
â
Coreference Chain Construction
â
Visualization Layer
Each layer contributes toward discourse-level entity understanding.
đ đ˛đģđđļđŧđģ đđ˛đđ˛đ°đđļđŧđģ
The mention detection layer combines multiple strategies:
âĸ Named Entity Recognition (PERSON, LOCATION, ORGANIZATION)
âĸ Pronoun Detection
âĸ Location Reference Detection
âĸ Rule-Based Noun Phrase Detection
These mentions become candidates for resolution.
đđģđđļđđ đĄđŧđŋđēđŽđšđļđđŽđđļđŧđģ
Tamil's rich morphology creates multiple surface forms for the same entity.
đđ
đŽđēđŊđšđ˛:
ââââââââââââââââââ
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽ˛ā¯
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽŠā¯
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽąā¯āŽā¯
ââââââââââââââââââ
Using Stanza lemmatization:
âââââââââââââââââââ
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽ˛ā¯
â
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
âââââââââââââââââââ
This reduces entity fragmentation and improves linking consistency.
đ§đĩđ˛ đđģđđļđđ đ đ˛đēđŧđŋđ đđŽđđ˛đŋ
One of the key design decisions was introducing a lightweight discourse memory.
Instead of neural antecedent scoring, the system maintains contextual entity state:
last_person
last_location
last_org
Whenever a new entity is detected, the corresponding memory state is updated.
This memory acts as the document's discourse context.
đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đĨđ˛đđŧđšđđđļđŧđģ
Once discourse memory has been established, the resolver links newly encountered mentions to previously observed canonical entities.
The system performs:
âĸ Person Pronoun Resolution
âĸ Possessive Resolution
âĸ Location Resolution
âĸ Rule-Based Noun Phrase Resolution
By maintaining discourse state across sentence boundaries, fragmented references are transformed into consistent entity representations.
This significantly improves downstream relation extraction quality.
đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đđĩđŽđļđģ đđŧđģđđđŋđđ°đđļđŧđģ
Rather than replacing mentions individually, the system groups related mentions into entity clusters.
consider this example:-
āŽāŽžāŽŽā¯.āŽ.āŽāŽĒāŽžāŽĒāŽ¤āŽŋ āŽāް❠āŽĒāŽŋāŽ°āŽĒāŽ˛ āŽāŽŽā¯āŽ āŽā¯āŽĩāŽāްā¯. āŽ
āŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽ˛ā¯ āŽĒāŽ˛ āŽāޞā¯āŽĩāŽŋāŽ¤ā¯ āŽ¤āŽŋāŽā¯āŽāŽā¯āŽāŽŗā¯ āŽŽā¯āŽŠā¯āŽŠā¯āŽā¯āޤā¯āŽ¤āŽžāŽ°ā¯. āŽāލā¯āޤ āŽāŽŽā¯āŽāŽā¯āŽĩāŽāް❠āŽĒāŽ˛ āŽĩāŽŋāŽ°ā¯āޤā¯āŽā޺❠āŽĒā¯āŽąā¯āŽąā¯āŽŗā¯āŽŗāŽžāŽ°ā¯. āŽ
āŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āޤāŽŋāŽ˛ā¯ āŽĒāŽŋāŽąāŽ¨ā¯āŽ¤āŽžāŽ°ā¯. āŽ
āŽā¯āŽā¯ āŽ
āŽĩāŽ°ā¯āŽā¯āŽā¯ āŽĒā¯āްā¯āŽŽā¯ āŽŽāŽ¤āŽŋāŽĒā¯āŽĒ❠āŽāްā¯āލā¯āŽ¤āŽ¤ā¯. āŽāŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯āޞāŽāޤā¯āޤāŽŋāŽŠā¯ āŽāްā¯āŽĩāŽžāŽā¯āŽāޤā¯āޤāŽŋāŽ˛ā¯ āŽŽā¯āŽā¯āŽāŽŋāŽ¯ āŽĒāŽā¯āŽāŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯.
âââââââââââââââââââââââ
Entity: āŽāŽžāŽŽā¯.āŽ.āŽāŽĒāŽžāŽĒāŽ¤āŽŋ
âââ āŽāŽžāŽŽā¯.āŽ.āŽāŽĒāŽžāŽĒāŽ¤āŽŋ
âââ āŽ
āŽĩāŽ°ā¯
âââ āŽāލā¯āޤ āŽāŽŽā¯āŽāŽā¯āŽĩāŽāްā¯
âââ āŽ
āŽĩāŽ°ā¯
âââ āŽāŽĩāŽ°ā¯
Entity: āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
âââ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
âââ āŽ
āŽā¯āŽā¯
âââââââââââââââââââââââ
These chains provide a document-level view of entity references.
Useful for:
âĸ Debugging
âĸ Evaluation
âĸ Knowledge Graph Construction
âĸ Relation Extraction
đŠđļđđđŽđšđļđđŽđđļđŧđģ & đđģđŽđšđđđļđ
To support experimentation and validation, we developed a Streamlit-based visualization layer.
Users can:
âĸ Submit Tamil documents
âĸ Inspect generated coreference chains
âĸ Analyze entity clusters
âĸ Validate resolution decisions
This provides transparency into the resolution process and helps identify weaknesses in rule design.
đđ˛đ đđģđđļđ´đĩđ
đđŧđŋđ˛đŗđ˛đŋđ˛đģđ°đ˛ đĨđ˛đđŧđšđđđļđŧđģ đļđ đģđŧđ đēđ˛đŋđ˛đšđ đŽ đŊđŋđŧđģđŧđđģ-đŋđ˛đđŧđšđđđļđŧđģ đđŽđđ¸.
It is an entity consistency layer that connects:
âĸ Named Entity Recognition
âĸ Relation Extraction
âĸ Knowledge Graph Construction
âĸ Semantic Search
âĸ RAG Systems
Without coreference:
âââââââââââââââââââ
(āŽāŽĩāŽ°ā¯, āŽĒāŽā¯āŽāŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯
, āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯āޞāŽāŽŽā¯)
âââââââââââââââââââ
With coreference:
âââââââââââââââââââââââââ
(āŽāŽžāŽŽā¯.āŽ.āŽāŽĒāŽžāŽĒāŽ¤āŽŋ, āŽĒāŽā¯āŽāŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯
, āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯āޞāŽāŽŽā¯)
âââââââââââââââââââââââââ
The second representation is immediately usable within structured knowledge systems.
đĻđđđđ˛đē đđēđŊđšđ˛đēđ˛đģđđŽđđļđŧđģ đŽđ đđ§đĄđđŖđĨ
Current capabilities include:
â
Named Entity Recognition
â
Entity Normalization
â
Pronoun Resolution
â
Possessive Resolution
â
Location Resolution
â
Rule-Based Noun Phrase Resolution
â
Coreference Chain Construction
â
Streamlit-Based Visualization
The system acts as a foundational layer between entity extraction and knowledge graph generation.
đđđđđŋđ˛ đĒđŧđŋđ¸
âĸ Multi-entity discourse memory
âĸ Entity salience tracking
âĸ Advanced noun phrase resolution
âĸ Relation-aware coreference resolution
âĸ Knowledge graph integration
âĸ Hybrid neural-rule architectures
đđŧđģđ°đšđđđļđŧđģ
Building effective Tamil Information Extraction systems requires more than Named Entity Recognition.
By introducing a dedicated coreference resolution layer, we can maintain entity consistency across documents, improve relation extraction quality, and generate more reliable structured knowledge.
For low-resource languages such as Tamil, carefully designed rule-based systems remain a practical and effective pathway toward document-level semantic understanding while larger neural approaches continue to mature.
03/06/2026
âĄī¸đđļđģđ˛-đ§đđģđļđģđ´ đđģđąđļđ°đĄđđĨ đŗđŧđŋ đĻđŋđļ đđŽđģđ¸đŽđģ đ§đŽđēđļđš đĄđŽđēđ˛đą đđģđđļđđ đĨđ˛đ°đŧđ´đģđļđđļđŧđģ
Transformer-based multilingual NLP systems have significantly improved Named Entity Recognition (NER) across many languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation.
At CTNLPR, we fine-tuned đŽđļđ°đ¯đĩđŽđŋđŽđ/đđģđąđļđ°đĄđđĨ specifically for Sri Lankan Tamil using a custom annotated NER corpus.
đĸđ¯đˇđ˛đ°đđļđđ˛
Improve entity recognition for:
âĸ Sri Lankan Tamil linguistic patterns
âĸ Local person, location, and organization names
âĸ Morphology-aware contextual variations
đĒđĩđ đĻđŋđļ đđŽđģđ¸đŽđģ đ§đŽđēđļđš đĄđđĨ đļđ đđĩđŽđšđšđ˛đģđ´đļđģđ´
Most multilingual NER systems are trained primarily on:
âĸ General web corpora
âĸ Indian Tamil datasets
âĸ Multilingual benchmark datasets
âĸ Formal textual sources
When applied to Sri Lankan Tamil, they often struggle with:
âĸ Regional naming conventions
âĸ Local organization terminology
âĸ Morphological suffix complexity
âĸ OCR-induced token inconsistencies
âĸ Subword tokenization fragmentation
âĸ Ambiguous entity boundaries
These limitations directly affect downstream systems such as:
âĸ Semantic Search
âĸ Document Intelligence
âĸ Knowledge Graph Construction
âĸ Tamil Chatbots
âĸ RAG Systems
âĸ Government Document Processing
đ đŧđąđ˛đš đđļđģđ˛-đ§đđģđļđģđ´ đĸđđ˛đŋđđļđ˛đ
đđŽđđ˛ đ đŧđąđ˛đš
â ai4bharat/IndicNER
đđģđđļđđ đ§đđŊđ˛đ
âĸ PERSON
âĸ LOCATION
âĸ ORGANIZATION
đđ˛đ đĸđŊđđļđēđļđđŽđđļđŧđģđ
â
Tamil-safe Tokenization
â
Unicode Normalization
â
BIO Tagging
â
Proper Subword Label Alignment
â
Morphology-aware Training
â
OCR-aware Preprocessing
đ§đ˛đ°đĩđģđļđ°đŽđš đđĩđŽđšđšđ˛đģđ´đ˛đ
1ī¸âŖ đ§đŽđēđļđš đ§đŧđ¸đ˛đģđļđđŽđđļđŧđģ
Tamil is morphologically rich. Incorrect tokenization can cause:
âĸ Broken entity spans
âĸ Incorrect BIO labels
âĸ Fragmented predictions
2ī¸âŖ đĻđđ¯đđŧđŋđą đđŽđ¯đ˛đš đđšđļđ´đģđēđ˛đģđ
Transformer tokenizers frequently split Tamil words into multiple subword units.
Without proper alignment:
âĸ Entity spans become corrupted
âĸ BIO labels mismatch
âĸ Training instability increases
3ī¸âŖ đĸđđĨ đĄđŧđļđđ˛
Tamil OCR systems still generate:
âĸ Grapheme inconsistencies
âĸ Merged tokens
âĸ Invalid Unicode combinations
âĸ Punctuation corruption
Therefore OCR-aware normalization was integrated before training.
đ đŧđąđ˛đš đđđŽđšđđŽđđļđŧđģ
đĸđđ˛đŋđŽđšđš đŖđ˛đŋđŗđŧđŋđēđŽđģđ°đ˛
âĸ F1 Score â 0.650
âĸ Precision â 0.602
âĸ Recall â 0.707
âĸ Accuracy â 96.04%
đđģđđļđđ-đđļđđ˛ đđ
âĸ PERSON â 0.721
âĸ LOCATION â 0.698
âĸ ORGANIZATION â 0.484
PERSON and LOCATION categories achieved relatively strong performance, while ORGANIZATION entities remain the most challenging category.
đđģđŗđ˛đŋđ˛đģđ°đ˛ đđ
đŽđēđŊđšđ˛đ
đđ
đŽđēđŊđšđ˛ đ
Sentence:
"āŽĒāŽžāŽ°āŽ¤āŽŋāŽ¤āŽžāŽāŽŠā¯ āŽāŽ´ā¯āޤāŽŋāŽ¯ āŽ¨ā¯āޞ❠āŽĒāŽžāŽ°āŽ¤āŽŋ āŽĒāŽ¤āŽŋāŽĒā¯āŽĒāŽāŽŽā¯ āŽĩā¯āŽŗāŽŋāŽ¯āŽŋāŽā¯āŽāޤā¯."
Output:
đ¤ PERSON â āŽĒāŽžāŽ°āŽ¤āŽŋāŽ¤āŽžāŽāŽŠā¯
đĸ ORGANIZATION â āŽĒāŽžāŽ°āŽ¤āŽŋ āŽĒāŽ¤āŽŋāŽĒā¯āŽĒāŽāŽŽā¯
đđ
đŽđēđŊđšđ˛ đŽ
Sentence:
"āŽĩāŽāŽŽāŽ°āŽžāŽā¯āŽāŽŋ āŽ¤ā¯āŽ´āŽŋāŽ˛ā¯āލā¯āŽā¯āŽĒ āŽ¨āŽŋāŽąā¯āŽĩāŽŠāŽŽā¯ āŽŽāŽžāŽŖāŽĩāŽ°ā¯āŽā޺❠āŽā¯āްā¯āޤā¯āŽ¤āŽ¤ā¯."
Output:
đĸ ORGANIZATION â āŽĩāŽāŽŽāŽ°āŽžāŽā¯āŽāŽŋ āŽ¤ā¯āŽ´āŽŋāŽ˛ā¯āލā¯āŽā¯āŽĒ āŽ¨āŽŋāŽąā¯āŽĩāŽŠāŽŽā¯
đđ
đŽđēđŊđšđ˛ đ¯
Sentence:
"āŽ¨āŽĩāŽŽāŽŖāŽŋ āŽāŽŋāŽ°āŽžāŽŽāŽŽā¯ āŽĩā¯āŽŗā¯āŽŗāŽ¤ā¯āŽ¤āŽžāŽ˛ā¯ āŽĒāŽžāŽ¤āŽŋāŽā¯āŽāŽĒā¯āŽĒāŽā¯āŽāޤā¯."
Output:
đ LOCATION â āŽ¨āŽĩāŽŽāŽŖāŽŋ
đđ
đŽđēđŊđšđ˛ đ°
Sentence:
"āŽā¯.āŽ.āŽāޏā¯.āŽĒā¯. āŽāޝāŽāŽŋāŽā¯āŽ āŽ¨āŽĩāŽŽāŽŖāŽŋ āŽāŽŋāŽ°āŽžāŽŽāŽ¤ā¯āޤāŽŋāŽąā¯āŽā¯ āŽā¯āŽŠā¯āŽąāŽžāŽ°ā¯."
Output:
đ¤ PERSON â āŽā¯.āŽ.āŽāޏā¯.āŽĒā¯. āŽāޝāŽāŽŋāŽā¯āŽ
đ LOCATION â āŽ¨āŽĩāŽŽāŽŖāŽŋ
đđ˛đ đĸđ¯đđ˛đŋđđŽđđļđŧđģ
One of the most important findings from this work is:
"Better preprocessing and domain-specific data can be as important as model architecture."
For low-resource languages like Sri Lankan Tamil:
âĸ High-quality annotations matter
âĸ OCR normalization matters
âĸ Tokenizer alignment matters
âĸ Linguistic preprocessing matters
Large transformer architectures alone are not sufficient without carefully prepared language-specific datasets.
đđŊđŊđšđļđ°đŽđđļđŧđģđ
âĸ Tamil NER Systems
âĸ Semantic Search
âĸ RAG Pipelines
âĸ OCR Information Extraction
âĸ Knowledge Graph Construction
âĸ Tamil Chatbots
This work is part of ongoing Tamil NLP research at CTNLPR aimed at building stronger NLP infrastructure for low-resource Tamil language technologies.
25/05/2026
đđđŽđĸđĨđđĸđ§đ đ đđĢđĸ đđđ§đ¤đđ§ đđđĻđĸđĨ đđđĻđđ đđ§đđĸđđ˛ đđđđ¨đ đ§đĸđđĸđ¨đ§ đđđđđŦđđ đđ¨đĢ đđ¨đ°-đđđŦđ¨đŽđĢđđ đđđ
The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasetsâespecially for foundational tasks like Named Entity Recognition (NER).
To address this gap, we developed the Srilankan-Tamil-NER Dataset, a Tamil NER dataset designed specifically for Sri Lankan Tamil linguistic and contextual usage.
This dataset is intended to support:
âĸ Tamil NER research
âĸ Indic language fine-tuning
âĸ Information extraction systems
âĸ Retrieval-Augmented Generation (RAG)
âĸ Tamil LLM adaptation
âĸ Domain-specific AI systems for Sri Lanka
đĒđĩđ đĻđŋđļ đđŽđģđ¸đŽđģ đ§đŽđēđļđš đĄđđĨ đ đŽđđđ˛đŋđ
Named Entity Recognition (NER) is a core NLP task that identifies and classifies entities such as:
âĸ Person names
âĸ Locations
âĸ Organizations
âĸ Dates
âĸ Miscellaneous entities
NER acts as a foundational layer for many downstream NLP systems including:
âĸ Question answering
âĸ Search systems
âĸ Chatbots
âĸ Document intelligence
âĸ Machine translation
âĸ Knowledge graph generation
For Tamil â particularly Sri Lankan Tamil â publicly available annotated corpora remain extremely limited. Existing multilingual datasets often underrepresent regional linguistic variations, local named entities, and culturally contextual terminology.
Most existing NER systems for Tamil are trained on datasets originating from Indian Tamil corpora, leaving significant gaps in handling:
âĸ Sri Lankan Tamil vocabulary
âĸ Local organization names
âĸ Sri Lankan place names
âĸ Government and institutional terminology
Our dataset aims to bridge this gap.
đđ¯đŧđđ đđĩđ˛ đđŽđđŽđđ˛đ
đđŽđđŽđđ˛đ đĄđŽđēđ˛:
Srilankan-Tamil-NER Dataset
The primary goal of this dataset is to create a high-quality manually curated Named Entity Recognition corpus for Sri Lankan Tamil under CTNLPR.
The dataset is structured to support fine-tuning transformer-based multilingual models such as:
âĸ IndicNER
âĸ mBERT
âĸ XLM-RoBERTa
âĸ MuRIL
âĸ IndicBERT
đđŽđđŽđđ˛đ đĻđđŽđđļđđđļđ°đ
âĸ B-PER (Person): 4,533
âĸ B-LOC (Location): 8,110
âĸ B-ORG (Organization): 3,369
âĸ Total Entities: 16,012
đđŽđđŽđđ˛đ đŖđŋđ˛đŊđŽđŋđŽđđļđŧđģ đŖđļđŊđ˛đšđļđģđ˛
Creating a Tamil NER dataset involves significantly more than simple annotation.
The preparation workflow included multiple stages:
1. đĢđđđ đĒđđđđđđđđđ
The raw Tamil text corpus was collected from the Noolaham corpus and other relevant publicly available Sri Lankan Tamil textual sources.
Special attention was given to:
âĸ Local linguistic relevance
âĸ Entity diversity
âĸ Sentence quality
âĸ Contextual richness
The objective was to capture realistic Sri Lankan Tamil usage patterns rather than synthetic or translated text.
2. đļđĒđš đđđ
đģđđđ đĩđđđđđđđđđđđđ
Tamil NLP pipelines often begin with scanned or image-based documents.
As part of our broader Tamil document intelligence workflow, OCR-extracted Tamil text underwent:
âĸ Unicode normalization
âĸ Punctuation cleaning
âĸ Whitespace normalization
âĸ Invalid character filtering
âĸ OCR noise reduction
OCR-related preprocessing becomes extremely important because Tamil script errors can propagate heavily into token classification systems.
3. đĩđđđđ
đŦđđđđđ đ¨đđđđđđđđđ
The dataset was manually annotated using BIO tagging format.
Entity Types:
âĸ B-PER â Beginning of person entity
âĸ I-PER â Inside person entity
âĸ B-LOC â Beginning of location entity
âĸ I-LOC â Inside location entity
âĸ B-ORG â Beginning of organization entity
âĸ I-ORG â Inside organization entity
âĸ O â Non-entity token
Example:
āŽāŽ°āŽžāŽŽāŽ¨āŽžāŽ¤āŽŠā¯ â B-PER
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯ â B-LOC
āŽĒāŽ˛ā¯āŽāޞā¯āŽā¯āŽāŽ´āŽāŽŽā¯ â B-ORG
đđĩđŽđšđšđ˛đģđ´đ˛đ đļđģ đĻđŋđļ đđŽđģđ¸đŽđģ đ§đŽđēđļđš đĄđđĨ
Building a Tamil NER dataset introduced several language-specific challenges.
âĸ Morphological complexity
âĸ OCR noise
âĸ Unicode inconsistencies
âĸ Token boundary detection
âĸ Subword alignment
âĸ Limited benchmark corpora
đđļđģđ˛-đ§đđģđļđģđ´ đ¨đđ˛ đđŽđđ˛đ
This dataset can support:
âĸ Tamil NER
âĸ OCR post-processing
âĸ Semantic search systems
âĸ RAG pipelines
âĸ Tamil chatbots
âĸ Government document AI
âĸ Knowledge graph generation
The Srilankan-Tamil-NER Dataset, developed under CTNLPR, represents an important step toward strengthening the Sri Lankan Tamil NLP ecosystem through high-quality entity annotation and linguistically relevant corpus preparation.
#đđđđŋđđđđđđđđđđ #đđđđđđđ¸đ
#đđđđđđđŋđ #đđđđđđ¸đđĄđđĄđĻđ
đđđđđđđĄđđđ #đŋđđ¤đ
đđ đđĸđđđđđŋđ #đŧđđđđđđŋđ #đđĸđ
đŧđŋ #đđĩđ¸đ
đ #đđŋđđ
#đĩđŧđđĄđđđđđđ #đŋđŋđ #đ
đ´đē #đđđđđđĄđđđđđđđâ #đžđđđ¤đđđđđđēđđđâđ #đ´đŧđ¸đđđđđđđđđđ #đˇđđđĸđđđđĄđŧđđĄđđđđđđđđđ #đ¸đđĄđđĄđĻđ¸đĨđĄđđđđĄđđđ #đđļđ
#đđđđđ đđđđđđđđđđđđ #đļđđđđĸđĄđđĄđđđđđđŋđđđđĸđđ đĄđđđ #đļđđđŋđđ
15/05/2026
âĄī¸ đđŽđĸđĨđđĸđ§đ đ đđđĻđĸđĨ đđ¨đĢđđđđĢđđ§đđ đđđŦđ¨đĨđŽđđĸđ¨đ§ đđ˛đŦđđđĻ: đđđĢđ¨-đđĄđ¨đ đđ¨đ§đđđąđđŽđđĨ đđ¨đđđĨđĸđ§đ đ°đĸđđĄ đđŽđđđ
Coreference Resolution (CR) is a critical NLP task for identifying whether multiple mentions in a document refer to the same real-world entity. It plays a major role in:
âĸ Knowledge Graph Construction
âĸ Relation Extraction
âĸ Semantic Search
âĸ RAG Systems
âĸ Conversational AI
âĸ Document-Level Understanding
While English NLP already has mature coreference systems and libraries, Tamil remains a highly challenging low-resource language for discourse-level semantic modeling.
At CTNLPR, we explored how modern multilingual coreference architectures can be adapted for Tamil using zero-shot contextual semantic modeling instead of heavily supervised pipelines.
Tamil introduces several difficult linguistic challenges:
âĸ Agglutinative morphology
âĸ Free word order
âĸ Pronoun dropping
âĸ Implicit subject references
âĸ Rich inflectional structures
âĸ Long-distance discourse dependencies
âĸ Noun-to-noun semantic references
Additionally:
âĸ No dedicated Tamil coreference libraries currently exist publicly
âĸ Large annotated Tamil CR datasets are unavailable
âĸ Most multilingual systems remain heavily English-biased
âī¸ Our Architecture
The architecture currently being explored at CTNLPR uses:
âĸ MuRIL-based contextual embeddings
âĸ Span-based mention detection
âĸ Contextual span representations
âĸ Cosine similarity-based semantic linking
âĸ Agglomerative clustering
Instead of manually defining antecedents, the system automatically generates semantic mention spans from Tamil text and groups semantically related mentions into discourse-level entity chains using contextual similarity.
đ§ Key Technical Direction
Traditional supervised coreference systems depend heavily on:
âĸ Large annotated corpora
âĸ Expensive training pipelines
âĸ Language-specific supervision
âĸ Antecedent ranking architectures
âĸ High computational cost
For Tamil, these resources are extremely limited.
Our approach avoids heavy annotation dependency while still leveraging multilingual transformer-based semantic understanding learned from Indian-language pretraining.
Each candidate span is encoded using contextual embeddings generated from MuRIL, and span-level semantic representations are constructed using contextual token pooling. The system then performs:
âĸ Heuristic span pruning
âĸ Semantic similarity computation
âĸ Similarity-driven clustering
to generate discourse-level coreference chains.
đ Key Advantages
âĸ Zero-shot inference
âĸ Low-resource scalability
âĸ Context-aware semantic reasoning
âĸ Better adaptation to Tamil morphology
âĸ Lightweight unsupervised inference
âĸ Reduced annotation dependency
This architecture is being explored at CTNLPR as a foundation for:
âĸ Tamil discourse understanding
âĸ Entity-aware semantic linking
âĸ Knowledge Graph Construction
âĸ Ontology-aware NLP
âĸ Multilingual semantic reasoning
âĸ Advanced RAG systems
đŦ Building document-level semantic understanding for Tamil is one of the next major steps toward scalable low-resource AI systems.
04/05/2026
âĄđđŽđĸđĨđđĸđ§đ đ đđ¨đĢđŠđĄđ¨đĨđ¨đ đ˛-đđ°đđĢđ đđđĻđĸđĨ đđđ đđĸđŠđđĨđĸđ§đ: đ
đĢđ¨đĻ đđĢđđ§đŦđđ¨đĢđĻđđĢ đđąđđĢđđđđĸđ¨đ§ đđ¨ đđđ§đ¨đ§đĸđđđĨ đđ§đđĸđđ˛ đđđŦđ¨đĨđŽđđĸđ¨đ§
Named Entity Recognition (NER) is a critical layer in our Tamil NLP stack (search, indexing, knowledge graph construction, and RAG).
However, for Tamil, extracting entities is only half the problem â canonicalizing them is the real challenge.
đĨ What We Built in CTNLPR
We designed a Tamil-aware NER pipeline by extending a transformer-based model with morphological normalization:
âĸ ai4bharat/IndicNER â baseline entity extraction
âĸ Custom span merging â IOB consolidation
âĸ Prefix-based grouping â variant clustering
âĸ Morphological normalization layer â canonical entity resolution
Since Indic NER models are not morphology-aware, we explicitly evaluated and integrated normalization strategies.
đ°Method Exploration (What We Tried)
âĸ IndicNER (Transformer baseline)
â
Strong recall across entity types
â Produces multiple inflected variants of the same entity
âĸ Prefix-based grouping
â
Fast heuristic clustering
â Not linguistically grounded
âĸ UoM Thamizhi Morphological Normalizer (University of Moratuwa)
â
Linguistically motivated rule-based approach
â Limited effectiveness on real-world data
â Struggled with:
* Noisy OCR text
* Complex suffix chains
* Unseen word forms
âĸ Tamil Lemmatizer (final approach)
â
Consistent root-form extraction
â
Robust across inflected variants
â
Best empirical performance in our pipeline
đŦ Key Design Decision
Transformer models do not enforce canonical forms.
đ Surface forms like:
âĸ āŽāޞāŽā¯āŽā¯āޝāŽŋāŽ˛ā¯
âĸ āŽāޞāŽā¯āŽā¯āޝāŽŋāŽ˛ā¯āŽŽā¯
âĸ āŽāޞāŽā¯āŽā¯āޝāŽŋāŽ˛ā¯
are extracted as separate entities
đ After normalization:
âĸ āŽāޞāŽā¯āŽā¯
This enables many-to-one mapping, critical for system consistency.
đ§Š Our Setup
Pipeline:
âĸ Document â chunking
âĸ Transformer inference (IndicNER)
âĸ IOB span merging + filtering
âĸ Variant aggregation (prefix-based)
âĸ Morphological normalization (UoM explored â Lemmatizer selected)
âĸ Entity re-indexing
Example:
āŽāޞāŽā¯āŽā¯āޝāŽŋāŽ˛ā¯ â āŽāޞāŽā¯āŽā¯
āŽāޞāŽā¯āŽā¯āޝāŽŋāŽ˛ā¯āŽŽā¯ â āŽāޞāŽā¯āŽā¯
āŽāލā¯āޤāŽŋāŽ¯āŽžāŽĩāŽŋāŽ˛ā¯ â āŽāލā¯āޤāŽŋāŽ¯āŽž
âĄī¸ System-Level Challenges We Solved
âĸ Agglutinative suffix handling
âĸ Variant explosion in entity outputs
âĸ OCR/noisy input robustness
âĸ Canonical entity consistency across documents
đ What We Observed
âĸ Transformer NER â high recall, low canonical consistency
âĸ UoM morphological normalizer â linguistically sound but limited robustness
âĸ Lemmatizer â best normalization performance in practice
đ Final system:
IndicNER + Lemmatization (hybrid architecture)
âŗī¸ Key Insight
In Tamil NER, the challenge is not detection â
it is morphological normalization.
NER output â final entity
đ Canonicalization is essential for:
âĸ Indexing
âĸ Entity linking
âĸ Knowledge graphs
âĸ RAG systems
đ Outcome
We built a production-ready Tamil NER system that:
âĸ Resolves inflected entity variants
âĸ Produces stable canonical forms
âĸ Improves downstream retrieval and analytics
âĸ Scales across multi-document pipelines
đŦ This work is part of ongoing Tamil NLP system development at CTNLPR
24/04/2026
đ đđŽđĸđĨđđĸđ§đ đ đđđ˛đ°đ¨đĢđ đđąđđĢđđđđĸđ¨đ§ đđĸđŠđđĨđĸđ§đ đđ¨đĢ đđđĻđĸđĨ: đ
đĢđ¨đĻ đđđđđĸđŦđđĸđđđĨ đđđđĄđ¨đđŦ đđ¨ đđĻđđđđđĸđ§đ -đđđŦđđ đđ¨đđđĨđŦ
Keyword extraction is a core component in our NLP pipeline (search, indexing, and RAG).In practice, adapting existing methods for Tamil required careful system-level design, not just model selection.
âī¸ What We Built
We implemented a Tamil-aware keyword extraction pipeline by adapting standard NLP libraries:
âĸ scikit-learn â TF-IDF (statistical baseline)
âĸ Gensim â TextRank (graph-based ranking)
âĸ KeyBERT â embedding-based semantic extraction
Since these tools are not natively designed for Tamil, we integrated preprocessing using Indic NLP techniques (tokenization, normalization).
đ§ Method Evaluation (What Worked / What Didnât)
âĸ TF-IDF
â
Useful for corpus-level keyword distribution
â No semantic understanding
âĸ TextRank
â
Works without training
â Highly sensitive to tokenization quality
âĸ YAKE
â
Fast, strong baseline for per-document keywords
âĸ KeyBERT (final approach)
â
Captures semantic relevance
â
Best performance for Tamil when paired with proper embeddings
đŦ Key Design Decision
KeyBERT itself is not language-aware âit depends entirely on the embedding model.
đ Using default English embeddings â poor Tamil results
đ Using Tamil/Indic embeddings â strong semantic extraction
đ§Š Our Setup
We integrated KeyBERT with Tamil-capable embedding models:
âĸ l3cube-pune/tamil-sentence-bert-nli
âĸ ai4bharat/indic-bert
âĸparaphrase-multilingual-mpnet-base-v2
Pipeline:
âĸ Document â embedding
âĸ Candidate n-grams generation
âĸ Semantic similarity ranking
đ System-Level Challenges We Solved
âĸ Tamil stopword handling (custom lists)
âĸ Text normalization (spelling variations, diacritics)
âĸ Tokenization consistency
âĸ Handling low-resource language constraints
đ What We Observed
âĸ TF-IDF â strong for global topic words
âĸ YAKE â reliable lightweight baseline
âĸ KeyBERT + Tamil embeddings â best semantic keyword quality
đĄ Key Insight
In Tamil NLP, keyword extraction is not limited by the algorithm âit is constrained by:
âĸ Embedding quality
âĸ Tokenization
âĸ Text normalization
đ Outcome
We built a production-ready Tamil keyword extraction pipeline that:
âĸ Produces semantically meaningful keywords
âĸ Works across different document types
âĸ Integrates seamlessly into downstream RAG systems
đŦ This work is part of ongoing Noolaham GPT development at CTNLPR.
15/04/2026
⥠đđđŦđĸđ đ§đĸđ§đ đđ§ đđđđĸđđĸđđ§đ đđđĢđđ§đ¤đĸđ§đ đđđ˛đđĢ: đđŽđĨđđĸđĨđĸđ§đ đŽđđĨ đđĢđ¨đŦđŦ-đđ§đđ¨đđđĢ đđŠđđĸđĻđĸđŗđđđĸđ¨đ§ đđ¨đĢ đđđĻđĸđĨâđđ§đ đĨđĸđŦđĄ đđđ
In multilingual RAG systems, dense retrieval can surface relevant chunks, but retrieval alone is not sufficient.
Not all retrieved passages are equally relevant, and passing all candidates directly to the LLM leads to:
âĸ Increased token usage
âĸ Higher latency
âĸ Noisy context â degraded response quality
đ Problem
Dense retrievers often fail at ranking precision, especially for mixed-language queries (Tamil, English).
This results in:
âĸ Relevant documents ranked lower
âĸ Cross-lingual inconsistencies
âĸ Reduced downstream LLM answer quality
âī¸ Core Approach
At CTNLPR, we introduce a cross-encoder reranking layer to refine retrieval results.
Unlike bi-encoders, rerankers:
âĸ Jointly encode queryâdocument pairs
âĸ Capture fine-grained semantic relevance
âĸ Improve cross-lingual ranking consistency
đ This enables accurate ordering of multilingual candidates before generation.
đŦ Model Evaluation
We evaluated multiple multilingual rerankers:
âĸ BGE-v2-m3 â high accuracy, higher latency on CPU
âĸ jina-v3-multi â strong cross-lingual consistency
âĸ jina-v2-cpu-opt â best latencyâquality trade-off
âĸ gte-multilingual â stable performance
Without reranking, we observed:
âĸ Correct documents retrieved but mis-ranked
âĸ Ranking instability for mixed-language queries
âĸ Noise introduced by lexical fusion methods (e.g., RRF)
đ§Š Reranking Pipeline
We adopt a two-stage architecture:
1. Retrieve Top-K candidates (dense retrieval)
2. Apply cross-encoder reranker
3. Score and reorder candidates
4. Pass Top-N results to LLM
⥠CPU Optimization Strategy
Cross-encoders are computationally expensive, especially in CPU-only environments.
Our objective: maximize ranking quality under strict latency constraints.
1ī¸âŖ Candidate Reduction (High Impact)
âĸ Reduce Top-K before reranking (e.g., 100 â 20)
âĸ Directly minimizes forward passes
đĄ Largest performance gain comes from reducing reranker calls
2ī¸âŖ ONNX + INT8 Quantization
âĸ Convert PyTorch â ONNX
âĸ Apply INT8 dynamic quantization
Benefits:
âĸ Faster inference
âĸ Lower memory usage
âĸ Minimal impact on ranking quality
3ī¸âŖ Token & Runtime Optimization
âĸ Reduce max token length (512 â 256)
âĸ Optimize CPU threading (OMP / MKL)
âĸ Use efficient tokenization + batching
đĄ Self-attention scales as O(n²), making token reduction critical
đ Performance Signals
âĸ Latency reduced from seconds â sub-second range (~100Ã improvement)
âĸ Maintained strong ranking quality (MRR / nDCG)
âĸ Stable cross-lingual ranking (Tamil â English)
What Didnât Work
âĸ Similarity threshold filtering â unstable across scripts
âĸ RRF (Reciprocal Rank Fusion) â introduces lexical noise
đĄ Key Insight
Multilingual RAG is not just a retrieval problem â
it is a ranking precision problem.
âĸ Retrieval â ensures coverage
âĸ Reranking â ensures correctness
đ Outcome
âĸ Improved ranking accuracy across languages
âĸ Reduced CPU latency to production-ready levels
âĸ Efficient, scalable multilingual pipeline
âĸ Better handling of mixed-language queries
Multilingual RAG becomes reliable when retrieval and reranking are jointly optimized.
At CTNLPR, we designed and deployed this reranking layer as part of our TamilâEnglish RAG pipeline, focusing on CPU-efficient cross-lingual ranking for real-world, large-scale document systems.
09/04/2026
âĄī¸đđđŦđĸđ đ§đĸđ§đ đ đđĸđĨđĸđ§đ đŽđđĨ đđđ đđ˛đŦđđđĻ: đđĢđ¨đŦđŦ-đđĸđ§đ đŽđđĨ đđđ§đŦđ đđđđĢđĸđđ¯đđĨ đđ¨đĢ đđđĻđĸđĨâđđ§đ đĨđĸđŦđĄ
In multilingual RAG systems, the key challenge is cross-lingual retrieval â enabling a query in Tamil to retrieve semantically relevant Tamil and English passages from a unified index (and vice versa), without translation pipelines or language-specific partitioning.
âī¸ Core Approach
We rely on multilingual dense encoders that project Tamil and English into a shared semantic vector space, allowing semantically aligned content across languages to be retrieved using standard similarity search.
đŦ Model Evaluation
We evaluated:
âĸ Sentence Transformers (SBERT variants)
âĸ Indic-specific models (IndicBERT, MuRIL)
Observed limitations:
âĸ Weak TamilâEnglish alignment
âĸ Inconsistent cross-lingual similarity distributions
âĸ Lower recall in mixed-language retrieval
â
Selected Model
â intfloat/multilingual-e5-large
Reasons:
âĸ Built on XLM-RoBERTa-large (multilingual pretraining)
âĸ Trained with large-scale contrastive objectives (>1B pairs)
âĸ Fine-tuned on retrieval benchmarks (MS MARCO, Mr.TyDi, MIRACL)
âĸ Instruction-aware embedding (âquery:â / âpassage:â prefixes)
This results in strong cross-lingual ranking and alignment, especially for low-resource languages.
đ§Š Indexing Strategy
We use a unified embedding + single index design:
âĸ Chunk all documents (Tamil + English)
âĸ Encode using the same model
âĸ Store in one vector index
No language-based partitioning.
đ Retrieval Flow
1.Encode query (Tamil or English)
2 Perform ANN search (cosine similarity)
3.Retrieve top-k cross-lingual chunks
4.Pass to LLM for response synthesis
đ Benchmark Signals (MRR / nDCG)
Across multilingual benchmarks and internal evaluations:
âĸ MRR@10 â â better early precision in cross-lingual retrieval
âĸ nDCG@10 â â improved ranking quality for mixed-language queries
âĸ Recall@10 â â higher retrieval coverage (Tamil â English)
âĸ More stable cosine similarity distributions across scripts
These gains are primarily driven by large-scale contrastive training + retrieval-specific fine-tuning.
đĄ Key Insight
Cross-lingual RAG is not a database problem âit is an embedding alignment problem solved at training time.
đ Outcome
âĸ Stronger cross-lingual ranking (Mean Reciprocal Rank/nDCG improvements)
âĸ No translation overhead
âĸ Single index, reduced system complexity
âĸ Better knowledge coverage across languages
Multilingual retrieval becomes reliable when both languages share the same semantic space.