Center for Tamil Natural Language Processing Research

Center for Tamil Natural Language Processing Research

Share

Contact information, map and directions, contact form, opening hours, services, ratings, photos, videos and announcements from Center for Tamil Natural Language Processing Research, Education, 63, Sir Pon, Thirunelvelly, Ramanathan Road, Kallady, Jaffna.

Center for Tamil natural language processing research aims to research and develop natural language processing tools required for Tamil and to build an active scholarly network of people contributing to the advancement of the language. The Center for Tamil natural language processing research aims to research and develop natural language processing tools required for Tamil and to build an active scholarly network of people contributing to the advancement of the language.

15/06/2026

🚀𝗕𝘂đ—ļ𝗹𝗱đ—ļđ—ģ𝗴 𝗮 𝗧𝗮đ—ēđ—ļ𝗹 𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 đ—Ĩ𝗲𝘀đ—ŧ𝗹𝘂𝘁đ—ļđ—ŧđ—ģ đ—Ļ𝘆𝘀𝘁𝗲đ—ē đ—ŗđ—ŧđ—ŋ 𝗜đ—ģđ—ŗđ—ŧđ—ŋđ—ē𝗮𝘁đ—ļđ—ŧđ—ģ 𝗘𝘅𝘁đ—ŋ𝗮𝗰𝘁đ—ļđ—ŧđ—ģ

Modern Information Extraction systems rely on more than Named Entity Recognition (NER). While NER can identify entities such as people, locations, and organizations, it does not explain how references to those entities evolve throughout a document.

𝗧đ—ĩđ—ļ𝘀 đ—ļ𝘀 𝘄đ—ĩ𝗲đ—ŋ𝗲 𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 đ—Ĩ𝗲𝘀đ—ŧ𝗹𝘂𝘁đ—ļđ—ŧđ—ģ (𝗖đ—Ĩ) đ—¯đ—˛đ—°đ—ŧđ—ē𝗲𝘀 𝗲𝘀𝘀𝗲đ—ģ𝘁đ—ļ𝗮𝗹.

Coreference Resolution is the task of determining whether multiple mentions within a document refer to the same real-world entity. It acts as a critical bridge between entity recognition and structured knowledge extraction.

𝗔đ—Ŋđ—Ŋ𝗹đ—ļ𝗰𝗮𝘁đ—ļđ—ŧđ—ģ𝘀:

â€ĸ Knowledge Graph Construction
â€ĸ Relation Extraction
â€ĸ Semantic Search
â€ĸ Document Intelligence
â€ĸ Retrieval-Augmented Generation (RAG)
â€ĸ Conversational AI

Accurate coreference resolution is often the difference between fragmented information and coherent knowledge.

𝗧đ—ĩ𝗲 𝗜đ—ģđ—ŗđ—ŧđ—ŋđ—ē𝗮𝘁đ—ļđ—ŧđ—ģ 𝗘𝘅𝘁đ—ŋ𝗮𝗰𝘁đ—ļđ—ŧđ—ģ đ—Ŗđ—ŋđ—ŧđ—¯đ—šđ—˛đ—ē

During our work at CTNLPR, we observed a common challenge in Tamil document processing.

Documents rarely repeat the full entity name in every sentence. Instead, they rely on:

â€ĸ Pronouns
â€ĸ Possessive references
â€ĸ Descriptive noun phrases
â€ĸ Location references

Humans resolve these references naturally using context. Machines do not.

When we use coreference resolution The extracted knowledge becomes meaningful and directly usable within downstream systems.

đ—Ēđ—ĩ𝘆 đ—Ē𝗲 𝗗đ—ļ𝗱 𝗡đ—ŧ𝘁 đ—Ļ𝘁𝗮đ—ŋ𝘁 𝘄đ—ļ𝘁đ—ĩ 𝗡𝗲𝘂đ—ŋ𝗮𝗹 𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 𝗠đ—ŧ𝗱𝗲𝗹𝘀

Most modern coreference systems rely on:

â€ĸ Transformer-based architectures
â€ĸ Mention-ranking models
â€ĸ End-to-end neural systems
â€ĸ Span-ranking approaches

While highly effective for English and other high-resource languages, they typically require:

â€ĸ Large annotated datasets
â€ĸ Extensive model training
â€ĸ Significant computational resources
â€ĸ Language-specific supervision

Tamil currently lacks large-scale publicly available coreference corpora.

Instead of waiting for benchmark datasets, we explored a different direction:

𝗕𝘂đ—ļ𝗹𝗱 𝗮 𝗱𝗲𝘁𝗲đ—ŋđ—ēđ—ļđ—ģđ—ļ𝘀𝘁đ—ļ𝗰, 𝗲𝘅đ—Ŋ𝗹𝗮đ—ļđ—ģđ—Žđ—¯đ—šđ—˛, 𝗮đ—ģ𝗱 𝘁𝗮𝘀𝗸-đ—ŧđ—ŋđ—ļ𝗲đ—ģ𝘁𝗲𝗱 𝗰đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 đ—ŗđ—ŋ𝗮đ—ē𝗲𝘄đ—ŧđ—ŋ𝗸 đ—ŧđ—Ŋ𝘁đ—ļđ—ēđ—ļ𝘇𝗲𝗱 đ—ŗđ—ŧđ—ŋ 𝗜đ—ģđ—ŗđ—ŧđ—ŋđ—ē𝗮𝘁đ—ļđ—ŧđ—ģ 𝗘𝘅𝘁đ—ŋ𝗮𝗰𝘁đ—ļđ—ŧđ—ģ.

Our goal was not to compete with neural benchmarks.

Our goal was to improve relation extraction quality in real-world Tamil document processing.

đ—Ļ𝘆𝘀𝘁𝗲đ—ē 𝗔đ—ŋ𝗰đ—ĩđ—ļ𝘁𝗲𝗰𝘁𝘂đ—ŋ𝗲

Tamil Document
↓
Text Normalization
↓
Sentence Segmentation
↓
Mention Detection
↓
Entity Normalization
↓
Entity Memory
↓
Coreference Resolution
↓
Coreference Chain Construction
↓
Visualization Layer

Each layer contributes toward discourse-level entity understanding.

𝗠𝗲đ—ģ𝘁đ—ļđ—ŧđ—ģ 𝗗𝗲𝘁𝗲𝗰𝘁đ—ļđ—ŧđ—ģ

The mention detection layer combines multiple strategies:

â€ĸ Named Entity Recognition (PERSON, LOCATION, ORGANIZATION)
â€ĸ Pronoun Detection
â€ĸ Location Reference Detection
â€ĸ Rule-Based Noun Phrase Detection

These mentions become candidates for resolution.

𝗘đ—ģ𝘁đ—ļ𝘁𝘆 𝗡đ—ŧđ—ŋđ—ē𝗮𝗹đ—ļ𝘇𝗮𝘁đ—ļđ—ŧđ—ģ

Tamil's rich morphology creates multiple surface forms for the same entity.

𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲:

┌────────────────┐
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯

āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽ˛ā¯

āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽŠā¯

āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽąā¯āŽ•ā¯
└────────────────┘

Using Stanza lemmatization:

┌─────────────────┐
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽ˛ā¯
↓
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
└─────────────────┘

This reduces entity fragmentation and improves linking consistency.

𝗧đ—ĩ𝗲 𝗘đ—ģ𝘁đ—ļ𝘁𝘆 𝗠𝗲đ—ēđ—ŧđ—ŋ𝘆 𝗟𝗮𝘆𝗲đ—ŋ

One of the key design decisions was introducing a lightweight discourse memory.

Instead of neural antecedent scoring, the system maintains contextual entity state:

last_person
last_location
last_org

Whenever a new entity is detected, the corresponding memory state is updated.

This memory acts as the document's discourse context.

𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 đ—Ĩ𝗲𝘀đ—ŧ𝗹𝘂𝘁đ—ļđ—ŧđ—ģ

Once discourse memory has been established, the resolver links newly encountered mentions to previously observed canonical entities.

The system performs:

â€ĸ Person Pronoun Resolution
â€ĸ Possessive Resolution
â€ĸ Location Resolution
â€ĸ Rule-Based Noun Phrase Resolution

By maintaining discourse state across sentence boundaries, fragmented references are transformed into consistent entity representations.

This significantly improves downstream relation extraction quality.

𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 𝗖đ—ĩ𝗮đ—ļđ—ģ 𝗖đ—ŧđ—ģ𝘀𝘁đ—ŋ𝘂𝗰𝘁đ—ļđ—ŧđ—ģ

Rather than replacing mentions individually, the system groups related mentions into entity clusters.

consider this example:-

āŽšāŽžāŽŽā¯.āŽ.āŽšāŽĒāŽžāŽĒāŽ¤āŽŋ āŽ’āŽ°ā¯ āŽĒāŽŋāŽ°āŽĒāŽ˛ āŽšāŽŽā¯‚āŽ• āŽšā¯‡āŽĩāŽ•āŽ°ā¯. āŽ…āŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽ˛ā¯ āŽĒāŽ˛ āŽ•āŽ˛ā¯āŽĩāŽŋāŽ¤ā¯ āŽ¤āŽŋāŽŸā¯āŽŸāŽ™ā¯āŽ•āŽŗā¯ˆ āŽŽā¯āŽŠā¯āŽŠā¯†āŽŸā¯āŽ¤ā¯āŽ¤āŽžāŽ°ā¯. āŽ‡āŽ¨ā¯āŽ¤ āŽšāŽŽā¯‚āŽ•āŽšā¯‡āŽĩāŽ•āŽ°ā¯ āŽĒāŽ˛ āŽĩāŽŋāŽ°ā¯āŽ¤ā¯āŽ•āŽŗā¯ˆ āŽĒā¯†āŽąā¯āŽąā¯āŽŗā¯āŽŗāŽžāŽ°ā¯. āŽ…āŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽ¤ā¯āŽ¤āŽŋāŽ˛ā¯ āŽĒāŽŋāŽąāŽ¨ā¯āŽ¤āŽžāŽ°ā¯. āŽ…āŽ™ā¯āŽ•ā¯ āŽ…āŽĩāŽ°ā¯āŽ•ā¯āŽ•ā¯ āŽĒā¯†āŽ°ā¯āŽŽā¯ āŽŽāŽ¤āŽŋāŽĒā¯āŽĒ❁ āŽ‡āŽ°ā¯āŽ¨ā¯āŽ¤āŽ¤ā¯. āŽ‡āŽĩāŽ°ā¯ āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯‚āŽ˛āŽ•āŽ¤ā¯āŽ¤āŽŋāŽŠā¯ āŽ‰āŽ°ā¯āŽĩāŽžāŽ•ā¯āŽ•āŽ¤ā¯āŽ¤āŽŋāŽ˛ā¯ āŽŽā¯āŽ•ā¯āŽ•āŽŋāŽ¯ āŽĒāŽ™ā¯āŽ•āŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯.

┌─────────────────────┐
Entity: āŽšāŽžāŽŽā¯.āŽ.āŽšāŽĒāŽžāŽĒāŽ¤āŽŋ

├── āŽšāŽžāŽŽā¯.āŽ.āŽšāŽĒāŽžāŽĒāŽ¤āŽŋ
├── āŽ…āŽĩāŽ°ā¯
├── āŽ‡āŽ¨ā¯āŽ¤ āŽšāŽŽā¯‚āŽ•āŽšā¯‡āŽĩāŽ•āŽ°ā¯
├── āŽ…āŽĩāŽ°ā¯
└── āŽ‡āŽĩāŽ°ā¯

Entity: āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯

├── āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯
└── āŽ…āŽ™ā¯āŽ•ā¯
└─────────────────────┘

These chains provide a document-level view of entity references.

Useful for:

â€ĸ Debugging
â€ĸ Evaluation
â€ĸ Knowledge Graph Construction
â€ĸ Relation Extraction

𝗩đ—ļ𝘀𝘂𝗮𝗹đ—ļ𝘇𝗮𝘁đ—ļđ—ŧđ—ģ & 𝗔đ—ģ𝗮𝗹𝘆𝘀đ—ļ𝘀

To support experimentation and validation, we developed a Streamlit-based visualization layer.

Users can:

â€ĸ Submit Tamil documents
â€ĸ Inspect generated coreference chains
â€ĸ Analyze entity clusters
â€ĸ Validate resolution decisions

This provides transparency into the resolution process and helps identify weaknesses in rule design.

𝗞𝗲𝘆 𝗜đ—ģ𝘀đ—ļ𝗴đ—ĩ𝘁

𝗖đ—ŧđ—ŋđ—˛đ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 đ—Ĩ𝗲𝘀đ—ŧ𝗹𝘂𝘁đ—ļđ—ŧđ—ģ đ—ļ𝘀 đ—ģđ—ŧ𝘁 đ—ē𝗲đ—ŋ𝗲𝗹𝘆 𝗮 đ—Ŋđ—ŋđ—ŧđ—ģđ—ŧ𝘂đ—ģ-đ—ŋ𝗲𝘀đ—ŧ𝗹𝘂𝘁đ—ļđ—ŧđ—ģ 𝘁𝗮𝘀𝗸.

It is an entity consistency layer that connects:

â€ĸ Named Entity Recognition
â€ĸ Relation Extraction
â€ĸ Knowledge Graph Construction
â€ĸ Semantic Search
â€ĸ RAG Systems

Without coreference:

┌─────────────────┐
(āŽ‡āŽĩāŽ°ā¯, āŽĒāŽ™ā¯āŽ•āŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯
, āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯‚āŽ˛āŽ•āŽŽā¯)
└─────────────────┘

With coreference:

┌───────────────────────┐
(āŽšāŽžāŽŽā¯.āŽ.āŽšāŽĒāŽžāŽĒāŽ¤āŽŋ, āŽĒāŽ™ā¯āŽ•āŽžāŽąā¯āŽąāŽŋāŽŠāŽžāŽ°ā¯
, āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖ āŽ¨ā¯‚āŽ˛āŽ•āŽŽā¯)
└───────────────────────┘

The second representation is immediately usable within structured knowledge systems.

đ—Ļ𝘆𝘀𝘁𝗲đ—ē 𝗜đ—ēđ—Ŋ𝗹𝗲đ—ē𝗲đ—ģ𝘁𝗮𝘁đ—ļđ—ŧđ—ģ 𝗮𝘁 đ—–đ—§đ—Ąđ—Ÿđ—Ŗđ—Ĩ

Current capabilities include:

✅ Named Entity Recognition
✅ Entity Normalization
✅ Pronoun Resolution
✅ Possessive Resolution
✅ Location Resolution
✅ Rule-Based Noun Phrase Resolution
✅ Coreference Chain Construction
✅ Streamlit-Based Visualization

The system acts as a foundational layer between entity extraction and knowledge graph generation.

𝗙𝘂𝘁𝘂đ—ŋ𝗲 đ—Ēđ—ŧđ—ŋ𝗸

â€ĸ Multi-entity discourse memory
â€ĸ Entity salience tracking
â€ĸ Advanced noun phrase resolution
â€ĸ Relation-aware coreference resolution
â€ĸ Knowledge graph integration
â€ĸ Hybrid neural-rule architectures

𝗖đ—ŧđ—ģ𝗰𝗹𝘂𝘀đ—ļđ—ŧđ—ģ

Building effective Tamil Information Extraction systems requires more than Named Entity Recognition.

By introducing a dedicated coreference resolution layer, we can maintain entity consistency across documents, improve relation extraction quality, and generate more reliable structured knowledge.

For low-resource languages such as Tamil, carefully designed rule-based systems remain a practical and effective pathway toward document-level semantic understanding while larger neural approaches continue to mature.

03/06/2026

âšĄī¸đ—™đ—ļđ—ģ𝗲-𝗧𝘂đ—ģđ—ļđ—ģ𝗴 𝗜đ—ģ𝗱đ—ļ𝗰𝗡𝗘đ—Ĩ đ—ŗđ—ŧđ—ŋ đ—Ļđ—ŋđ—ļ 𝗟𝗮đ—ģ𝗸𝗮đ—ģ 𝗧𝗮đ—ēđ—ļ𝗹 𝗡𝗮đ—ē𝗲𝗱 𝗘đ—ģ𝘁đ—ļ𝘁𝘆 đ—Ĩ𝗲𝗰đ—ŧ𝗴đ—ģđ—ļ𝘁đ—ļđ—ŧđ—ģ

Transformer-based multilingual NLP systems have significantly improved Named Entity Recognition (NER) across many languages. However, low-resource language variants such as Sri Lankan Tamil still face substantial challenges due to limited domain-specific datasets and linguistic underrepresentation.

At CTNLPR, we fine-tuned 𝗮đ—ļđŸ°đ—¯đ—ĩ𝗮đ—ŋ𝗮𝘁/𝗜đ—ģ𝗱đ—ļ𝗰𝗡𝗘đ—Ĩ specifically for Sri Lankan Tamil using a custom annotated NER corpus.

đ—ĸđ—¯đ—ˇđ—˛đ—°đ˜đ—ļ𝘃𝗲

Improve entity recognition for:

â€ĸ Sri Lankan Tamil linguistic patterns
â€ĸ Local person, location, and organization names
â€ĸ Morphology-aware contextual variations

đ—Ēđ—ĩ𝘆 đ—Ļđ—ŋđ—ļ 𝗟𝗮đ—ģ𝗸𝗮đ—ģ 𝗧𝗮đ—ēđ—ļ𝗹 𝗡𝗘đ—Ĩ đ—ļ𝘀 𝗖đ—ĩ𝗮𝗹𝗹𝗲đ—ģ𝗴đ—ļđ—ģ𝗴

Most multilingual NER systems are trained primarily on:

â€ĸ General web corpora
â€ĸ Indian Tamil datasets
â€ĸ Multilingual benchmark datasets
â€ĸ Formal textual sources

When applied to Sri Lankan Tamil, they often struggle with:

â€ĸ Regional naming conventions
â€ĸ Local organization terminology
â€ĸ Morphological suffix complexity
â€ĸ OCR-induced token inconsistencies
â€ĸ Subword tokenization fragmentation
â€ĸ Ambiguous entity boundaries

These limitations directly affect downstream systems such as:

â€ĸ Semantic Search
â€ĸ Document Intelligence
â€ĸ Knowledge Graph Construction
â€ĸ Tamil Chatbots
â€ĸ RAG Systems
â€ĸ Government Document Processing

𝗠đ—ŧ𝗱𝗲𝗹 𝗙đ—ļđ—ģ𝗲-𝗧𝘂đ—ģđ—ļđ—ģ𝗴 đ—ĸ𝘃𝗲đ—ŋ𝘃đ—ļ𝗲𝘄

𝗕𝗮𝘀𝗲 𝗠đ—ŧ𝗱𝗲𝗹

→ ai4bharat/IndicNER

𝗘đ—ģ𝘁đ—ļ𝘁𝘆 𝗧𝘆đ—Ŋ𝗲𝘀

â€ĸ PERSON
â€ĸ LOCATION
â€ĸ ORGANIZATION

𝗞𝗲𝘆 đ—ĸđ—Ŋ𝘁đ—ļđ—ēđ—ļ𝘇𝗮𝘁đ—ļđ—ŧđ—ģ𝘀

✅ Tamil-safe Tokenization
✅ Unicode Normalization
✅ BIO Tagging
✅ Proper Subword Label Alignment
✅ Morphology-aware Training
✅ OCR-aware Preprocessing

𝗧𝗲𝗰đ—ĩđ—ģđ—ļ𝗰𝗮𝗹 𝗖đ—ĩ𝗮𝗹𝗹𝗲đ—ģ𝗴𝗲𝘀

1ī¸âƒŖ 𝗧𝗮đ—ēđ—ļ𝗹 𝗧đ—ŧ𝗸𝗲đ—ģđ—ļ𝘇𝗮𝘁đ—ļđ—ŧđ—ģ

Tamil is morphologically rich. Incorrect tokenization can cause:

â€ĸ Broken entity spans
â€ĸ Incorrect BIO labels
â€ĸ Fragmented predictions

2ī¸âƒŖ đ—Ļđ˜‚đ—¯đ˜„đ—ŧđ—ŋ𝗱 đ—Ÿđ—Žđ—¯đ—˛đ—š 𝗔𝗹đ—ļ𝗴đ—ģđ—ē𝗲đ—ģ𝘁

Transformer tokenizers frequently split Tamil words into multiple subword units.

Without proper alignment:

â€ĸ Entity spans become corrupted
â€ĸ BIO labels mismatch
â€ĸ Training instability increases

3ī¸âƒŖ đ—ĸ𝗖đ—Ĩ 𝗡đ—ŧđ—ļ𝘀𝗲

Tamil OCR systems still generate:

â€ĸ Grapheme inconsistencies
â€ĸ Merged tokens
â€ĸ Invalid Unicode combinations
â€ĸ Punctuation corruption

Therefore OCR-aware normalization was integrated before training.

𝗠đ—ŧ𝗱𝗲𝗹 𝗘𝘃𝗮𝗹𝘂𝗮𝘁đ—ļđ—ŧđ—ģ

đ—ĸ𝘃𝗲đ—ŋ𝗮𝗹𝗹 đ—Ŗđ—˛đ—ŋđ—ŗđ—ŧđ—ŋđ—ē𝗮đ—ģ𝗰𝗲

â€ĸ F1 Score → 0.650
â€ĸ Precision → 0.602
â€ĸ Recall → 0.707
â€ĸ Accuracy → 96.04%

𝗘đ—ģ𝘁đ—ļ𝘁𝘆-𝘄đ—ļ𝘀𝗲 𝗙𝟭

â€ĸ PERSON → 0.721
â€ĸ LOCATION → 0.698
â€ĸ ORGANIZATION → 0.484

PERSON and LOCATION categories achieved relatively strong performance, while ORGANIZATION entities remain the most challenging category.

𝗜đ—ģđ—ŗđ—˛đ—ŋ𝗲đ—ģ𝗰𝗲 𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲𝘀

𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲 𝟭

Sentence:
"āŽĒāŽžāŽ°āŽ¤āŽŋāŽ¤āŽžāŽšāŽŠā¯ āŽŽāŽ´ā¯āŽ¤āŽŋāŽ¯ āŽ¨ā¯‚āŽ˛ā¯ˆ āŽĒāŽžāŽ°āŽ¤āŽŋ āŽĒāŽ¤āŽŋāŽĒā¯āŽĒāŽ•āŽŽā¯ āŽĩā¯†āŽŗāŽŋāŽ¯āŽŋāŽŸā¯āŽŸāŽ¤ā¯."

Output:
👤 PERSON → āŽĒāŽžāŽ°āŽ¤āŽŋāŽ¤āŽžāŽšāŽŠā¯
đŸĸ ORGANIZATION → āŽĒāŽžāŽ°āŽ¤āŽŋ āŽĒāŽ¤āŽŋāŽĒā¯āŽĒāŽ•āŽŽā¯

𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲 𝟮

Sentence:
"āŽĩāŽŸāŽŽāŽ°āŽžāŽŸā¯āŽšāŽŋ āŽ¤ā¯ŠāŽ´āŽŋāŽ˛ā¯āŽ¨ā¯āŽŸā¯āŽĒ āŽ¨āŽŋāŽąā¯āŽĩāŽŠāŽŽā¯ āŽŽāŽžāŽŖāŽĩāŽ°ā¯āŽ•āŽŗā¯ˆ āŽšā¯‡āŽ°ā¯āŽ¤ā¯āŽ¤āŽ¤ā¯."

Output:
đŸĸ ORGANIZATION → āŽĩāŽŸāŽŽāŽ°āŽžāŽŸā¯āŽšāŽŋ āŽ¤ā¯ŠāŽ´āŽŋāŽ˛ā¯āŽ¨ā¯āŽŸā¯āŽĒ āŽ¨āŽŋāŽąā¯āŽĩāŽŠāŽŽā¯

𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲 đŸ¯

Sentence:
"āŽ¨āŽĩāŽŽāŽŖāŽŋ āŽ•āŽŋāŽ°āŽžāŽŽāŽŽā¯ āŽĩā¯†āŽŗā¯āŽŗāŽ¤ā¯āŽ¤āŽžāŽ˛ā¯ āŽĒāŽžāŽ¤āŽŋāŽ•ā¯āŽ•āŽĒā¯āŽĒāŽŸā¯āŽŸāŽ¤ā¯."

Output:
📍 LOCATION → āŽ¨āŽĩāŽŽāŽŖāŽŋ

𝗘𝘅𝗮đ—ēđ—Ŋ𝗹𝗲 𝟰

Sentence:
"āŽœā¯‡.āŽ.āŽŽāŽ¸ā¯.āŽĒ❀. āŽœāŽ¯āŽšāŽŋāŽ™ā¯āŽ• āŽ¨āŽĩāŽŽāŽŖāŽŋ āŽ•āŽŋāŽ°āŽžāŽŽāŽ¤ā¯āŽ¤āŽŋāŽąā¯āŽ•ā¯ āŽšā¯†āŽŠā¯āŽąāŽžāŽ°ā¯."

Output:
👤 PERSON → āŽœā¯‡.āŽ.āŽŽāŽ¸ā¯.āŽĒ❀. āŽœāŽ¯āŽšāŽŋāŽ™ā¯āŽ•
📍 LOCATION → āŽ¨āŽĩāŽŽāŽŖāŽŋ

𝗞𝗲𝘆 đ—ĸđ—¯đ˜€đ—˛đ—ŋ𝘃𝗮𝘁đ—ļđ—ŧđ—ģ

One of the most important findings from this work is:

"Better preprocessing and domain-specific data can be as important as model architecture."

For low-resource languages like Sri Lankan Tamil:

â€ĸ High-quality annotations matter
â€ĸ OCR normalization matters
â€ĸ Tokenizer alignment matters
â€ĸ Linguistic preprocessing matters

Large transformer architectures alone are not sufficient without carefully prepared language-specific datasets.

𝗔đ—Ŋđ—Ŋ𝗹đ—ļ𝗰𝗮𝘁đ—ļđ—ŧđ—ģ𝘀

â€ĸ Tamil NER Systems
â€ĸ Semantic Search
â€ĸ RAG Pipelines
â€ĸ OCR Information Extraction
â€ĸ Knowledge Graph Construction
â€ĸ Tamil Chatbots

This work is part of ongoing Tamil NLP research at CTNLPR aimed at building stronger NLP infrastructure for low-resource Tamil language technologies.

25/05/2026

🌟𝐁𝐮đĸđĨ𝐝đĸ𝐧𝐠 𝐚 𝐒đĢđĸ 𝐋𝐚𝐧𝐤𝐚𝐧 𝐓𝐚đĻđĸđĨ 𝐍𝐚đĻ𝐞𝐝 𝐄𝐧𝐭đĸ𝐭𝐲 𝐑𝐞𝐜𝐨𝐠𝐧đĸ𝐭đĸ𝐨𝐧 𝐃𝐚𝐭𝐚đŦ𝐞𝐭 𝐟𝐨đĢ 𝐋𝐨𝐰-𝐑𝐞đŦ𝐨𝐮đĢ𝐜𝐞 𝐍𝐋𝐏

The growth of Large Language Models (LLMs) and multilingual NLP systems has significantly improved language technologies across major global languages. However, low-resource languages such as Sri Lankan Tamil still face a severe lack of high-quality annotated datasets—especially for foundational tasks like Named Entity Recognition (NER).

To address this gap, we developed the Srilankan-Tamil-NER Dataset, a Tamil NER dataset designed specifically for Sri Lankan Tamil linguistic and contextual usage.

This dataset is intended to support:

â€ĸ Tamil NER research
â€ĸ Indic language fine-tuning
â€ĸ Information extraction systems
â€ĸ Retrieval-Augmented Generation (RAG)
â€ĸ Tamil LLM adaptation
â€ĸ Domain-specific AI systems for Sri Lanka

đ—Ēđ—ĩ𝘆 đ—Ļđ—ŋđ—ļ 𝗟𝗮đ—ģ𝗸𝗮đ—ģ 𝗧𝗮đ—ēđ—ļ𝗹 𝗡𝗘đ—Ĩ 𝗠𝗮𝘁𝘁𝗲đ—ŋ𝘀

Named Entity Recognition (NER) is a core NLP task that identifies and classifies entities such as:

â€ĸ Person names
â€ĸ Locations
â€ĸ Organizations
â€ĸ Dates
â€ĸ Miscellaneous entities

NER acts as a foundational layer for many downstream NLP systems including:

â€ĸ Question answering
â€ĸ Search systems
â€ĸ Chatbots
â€ĸ Document intelligence
â€ĸ Machine translation
â€ĸ Knowledge graph generation

For Tamil — particularly Sri Lankan Tamil — publicly available annotated corpora remain extremely limited. Existing multilingual datasets often underrepresent regional linguistic variations, local named entities, and culturally contextual terminology.

Most existing NER systems for Tamil are trained on datasets originating from Indian Tamil corpora, leaving significant gaps in handling:

â€ĸ Sri Lankan Tamil vocabulary
â€ĸ Local organization names
â€ĸ Sri Lankan place names
â€ĸ Government and institutional terminology

Our dataset aims to bridge this gap.

đ—”đ—¯đ—ŧ𝘂𝘁 𝘁đ—ĩ𝗲 𝗗𝗮𝘁𝗮𝘀𝗲𝘁

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 𝗡𝗮đ—ē𝗲:
Srilankan-Tamil-NER Dataset

The primary goal of this dataset is to create a high-quality manually curated Named Entity Recognition corpus for Sri Lankan Tamil under CTNLPR.

The dataset is structured to support fine-tuning transformer-based multilingual models such as:

â€ĸ IndicNER
â€ĸ mBERT
â€ĸ XLM-RoBERTa
â€ĸ MuRIL
â€ĸ IndicBERT

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 đ—Ļ𝘁𝗮𝘁đ—ļ𝘀𝘁đ—ļ𝗰𝘀

â€ĸ B-PER (Person): 4,533
â€ĸ B-LOC (Location): 8,110
â€ĸ B-ORG (Organization): 3,369
â€ĸ Total Entities: 16,012

𝗗𝗮𝘁𝗮𝘀𝗲𝘁 đ—Ŗđ—ŋ𝗲đ—Ŋ𝗮đ—ŋ𝗮𝘁đ—ļđ—ŧđ—ģ đ—Ŗđ—ļđ—Ŋ𝗲𝗹đ—ļđ—ģ𝗲

Creating a Tamil NER dataset involves significantly more than simple annotation.

The preparation workflow included multiple stages:

1. đ‘Ģ𝒂𝒕𝒂 đ‘Ē𝒐𝒍𝒍𝒆𝒄𝒕𝒊𝒐𝒏

The raw Tamil text corpus was collected from the Noolaham corpus and other relevant publicly available Sri Lankan Tamil textual sources.

Special attention was given to:

â€ĸ Local linguistic relevance
â€ĸ Entity diversity
â€ĸ Sentence quality
â€ĸ Contextual richness

The objective was to capture realistic Sri Lankan Tamil usage patterns rather than synthetic or translated text.

2. đ‘ļđ‘Ē𝑹 𝒂𝒏𝒅 đ‘ģ𝒆𝒙𝒕 đ‘ĩ𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒂𝒕𝒊𝒐𝒏

Tamil NLP pipelines often begin with scanned or image-based documents.

As part of our broader Tamil document intelligence workflow, OCR-extracted Tamil text underwent:

â€ĸ Unicode normalization
â€ĸ Punctuation cleaning
â€ĸ Whitespace normalization
â€ĸ Invalid character filtering
â€ĸ OCR noise reduction

OCR-related preprocessing becomes extremely important because Tamil script errors can propagate heavily into token classification systems.

3. đ‘ĩ𝒂𝒎𝒆𝒅 đ‘Ŧ𝒏𝒕𝒊𝒕𝒚 𝑨𝒏𝒏𝒐𝒕𝒂𝒕𝒊𝒐𝒏

The dataset was manually annotated using BIO tagging format.

Entity Types:

â€ĸ B-PER — Beginning of person entity
â€ĸ I-PER — Inside person entity
â€ĸ B-LOC — Beginning of location entity
â€ĸ I-LOC — Inside location entity
â€ĸ B-ORG — Beginning of organization entity
â€ĸ I-ORG — Inside organization entity
â€ĸ O — Non-entity token

Example:

āŽ‡āŽ°āŽžāŽŽāŽ¨āŽžāŽ¤āŽŠā¯ → B-PER
āŽ¯āŽžāŽ´ā¯āŽĒā¯āŽĒāŽžāŽŖāŽŽā¯ → B-LOC
āŽĒāŽ˛ā¯āŽ•āŽ˛ā¯ˆāŽ•ā¯āŽ•āŽ´āŽ•āŽŽā¯ → B-ORG

𝗖đ—ĩ𝗮𝗹𝗹𝗲đ—ģ𝗴𝗲𝘀 đ—ļđ—ģ đ—Ļđ—ŋđ—ļ 𝗟𝗮đ—ģ𝗸𝗮đ—ģ 𝗧𝗮đ—ēđ—ļ𝗹 𝗡𝗘đ—Ĩ

Building a Tamil NER dataset introduced several language-specific challenges.

â€ĸ Morphological complexity
â€ĸ OCR noise
â€ĸ Unicode inconsistencies
â€ĸ Token boundary detection
â€ĸ Subword alignment
â€ĸ Limited benchmark corpora

𝗙đ—ļđ—ģ𝗲-𝗧𝘂đ—ģđ—ļđ—ģ𝗴 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲𝘀

This dataset can support:

â€ĸ Tamil NER
â€ĸ OCR post-processing
â€ĸ Semantic search systems
â€ĸ RAG pipelines
â€ĸ Tamil chatbots
â€ĸ Government document AI
â€ĸ Knowledge graph generation

The Srilankan-Tamil-NER Dataset, developed under CTNLPR, represents an important step toward strengthening the Sri Lankan Tamil NLP ecosystem through high-quality entity annotation and linguistically relevant corpus preparation.

#𝑆𝑟𝑖đŋ𝑎𝑛𝑘𝑎𝑛𝑇𝑎𝑚𝑖𝑙 #𝑇𝑎𝑚𝑖𝑙𝑁𝐸𝑅 #𝑇𝑎𝑚𝑖𝑙𝑁đŋ𝑃 #𝑁𝑎𝑚𝑒𝑑𝐸𝑛𝑡𝑖𝑡đ‘Ļ𝑅𝑒𝑐𝑜𝑔𝑛𝑖𝑡𝑖𝑜𝑛 #đŋ𝑜𝑤𝑅𝑒𝑠𝑜đ‘ĸ𝑟𝑐𝑒𝑁đŋ𝑃 #đŧ𝑛𝑑𝑖𝑐𝑁đŋ𝑃 #𝑀đ‘ĸ𝑅đŧđŋ #𝑚đĩ𝐸𝑅𝑇 #𝑋đŋ𝑀𝑅 #đĩđŧ𝑂𝑡𝑎𝑔𝑔𝑖𝑛𝑔 #đŋđŋ𝑀 #𝑅𝐴đē #𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑆𝑒𝑎𝑟𝑐ℎ #𝐾𝑛𝑜𝑤𝑙𝑒𝑑𝑔𝑒đē𝑟𝑎𝑝ℎ𝑠 #𝐴đŧ𝐸𝑛𝑔𝑖𝑛𝑒𝑒𝑟𝑖𝑛𝑔 #𝐷𝑜𝑐đ‘ĸ𝑚𝑒𝑛𝑡đŧ𝑛𝑡𝑒𝑙𝑙𝑖𝑔𝑒𝑛𝑐𝑒 #𝐸𝑛𝑡𝑖𝑡đ‘Ļ𝐸đ‘Ĩ𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 #𝑂đļ𝑅 #𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟𝑀𝑜𝑑𝑒𝑙𝑠 #đļ𝑜𝑚𝑝đ‘ĸ𝑡𝑎𝑡𝑖𝑜𝑛𝑎𝑙đŋ𝑖𝑛𝑔đ‘ĸ𝑖𝑠𝑡𝑖𝑐𝑠 #đļ𝑇𝑁đŋ𝑃𝑅

15/05/2026

âšĄī¸ 𝐁𝐮đĸđĨ𝐝đĸ𝐧𝐠 𝐚 𝐓𝐚đĻđĸđĨ 𝐂𝐨đĢ𝐞𝐟𝐞đĢ𝐞𝐧𝐜𝐞 𝐑𝐞đŦ𝐨đĨ𝐮𝐭đĸ𝐨𝐧 𝐒𝐲đŦ𝐭𝐞đĻ: 𝐙𝐞đĢ𝐨-𝐒𝐡𝐨𝐭 𝐂𝐨𝐧𝐭𝐞𝐱𝐭𝐮𝐚đĨ 𝐌𝐨𝐝𝐞đĨđĸ𝐧𝐠 𝐰đĸ𝐭𝐡 𝐌𝐮𝐑𝐈𝐋

Coreference Resolution (CR) is a critical NLP task for identifying whether multiple mentions in a document refer to the same real-world entity. It plays a major role in:

â€ĸ Knowledge Graph Construction
â€ĸ Relation Extraction
â€ĸ Semantic Search
â€ĸ RAG Systems
â€ĸ Conversational AI
â€ĸ Document-Level Understanding

While English NLP already has mature coreference systems and libraries, Tamil remains a highly challenging low-resource language for discourse-level semantic modeling.

At CTNLPR, we explored how modern multilingual coreference architectures can be adapted for Tamil using zero-shot contextual semantic modeling instead of heavily supervised pipelines.

Tamil introduces several difficult linguistic challenges:

â€ĸ Agglutinative morphology
â€ĸ Free word order
â€ĸ Pronoun dropping
â€ĸ Implicit subject references
â€ĸ Rich inflectional structures
â€ĸ Long-distance discourse dependencies
â€ĸ Noun-to-noun semantic references

Additionally:

â€ĸ No dedicated Tamil coreference libraries currently exist publicly
â€ĸ Large annotated Tamil CR datasets are unavailable
â€ĸ Most multilingual systems remain heavily English-biased

âš™ī¸ Our Architecture

The architecture currently being explored at CTNLPR uses:

â€ĸ MuRIL-based contextual embeddings
â€ĸ Span-based mention detection
â€ĸ Contextual span representations
â€ĸ Cosine similarity-based semantic linking
â€ĸ Agglomerative clustering

Instead of manually defining antecedents, the system automatically generates semantic mention spans from Tamil text and groups semantically related mentions into discourse-level entity chains using contextual similarity.

🧠 Key Technical Direction

Traditional supervised coreference systems depend heavily on:

â€ĸ Large annotated corpora
â€ĸ Expensive training pipelines
â€ĸ Language-specific supervision
â€ĸ Antecedent ranking architectures
â€ĸ High computational cost

For Tamil, these resources are extremely limited.

Our approach avoids heavy annotation dependency while still leveraging multilingual transformer-based semantic understanding learned from Indian-language pretraining.

Each candidate span is encoded using contextual embeddings generated from MuRIL, and span-level semantic representations are constructed using contextual token pooling. The system then performs:

â€ĸ Heuristic span pruning
â€ĸ Semantic similarity computation
â€ĸ Similarity-driven clustering

to generate discourse-level coreference chains.

🚀 Key Advantages

â€ĸ Zero-shot inference
â€ĸ Low-resource scalability
â€ĸ Context-aware semantic reasoning
â€ĸ Better adaptation to Tamil morphology
â€ĸ Lightweight unsupervised inference
â€ĸ Reduced annotation dependency

This architecture is being explored at CTNLPR as a foundation for:

â€ĸ Tamil discourse understanding
â€ĸ Entity-aware semantic linking
â€ĸ Knowledge Graph Construction
â€ĸ Ontology-aware NLP
â€ĸ Multilingual semantic reasoning
â€ĸ Advanced RAG systems

đŸ”Ŧ Building document-level semantic understanding for Tamil is one of the next major steps toward scalable low-resource AI systems.

04/05/2026

⚡𝐁𝐮đĸđĨ𝐝đĸ𝐧𝐠 𝐚 𝐌𝐨đĢ𝐩𝐡𝐨đĨ𝐨𝐠𝐲-𝐀𝐰𝐚đĢ𝐞 𝐓𝐚đĻđĸđĨ 𝐍𝐄𝐑 𝐏đĸ𝐩𝐞đĨđĸ𝐧𝐞: 𝐅đĢ𝐨đĻ 𝐓đĢ𝐚𝐧đŦ𝐟𝐨đĢđĻ𝐞đĢ 𝐄𝐱𝐭đĢ𝐚𝐜𝐭đĸ𝐨𝐧 𝐭𝐨 𝐂𝐚𝐧𝐨𝐧đĸ𝐜𝐚đĨ 𝐄𝐧𝐭đĸ𝐭𝐲 𝐑𝐞đŦ𝐨đĨ𝐮𝐭đĸ𝐨𝐧

Named Entity Recognition (NER) is a critical layer in our Tamil NLP stack (search, indexing, knowledge graph construction, and RAG).
However, for Tamil, extracting entities is only half the problem — canonicalizing them is the real challenge.

đŸ’Ĩ What We Built in CTNLPR

We designed a Tamil-aware NER pipeline by extending a transformer-based model with morphological normalization:

â€ĸ ai4bharat/IndicNER → baseline entity extraction
â€ĸ Custom span merging → IOB consolidation
â€ĸ Prefix-based grouping → variant clustering
â€ĸ Morphological normalization layer → canonical entity resolution

Since Indic NER models are not morphology-aware, we explicitly evaluated and integrated normalization strategies.

🔰Method Exploration (What We Tried)

â€ĸ IndicNER (Transformer baseline)
✅ Strong recall across entity types
❎ Produces multiple inflected variants of the same entity

â€ĸ Prefix-based grouping
✅ Fast heuristic clustering
❎ Not linguistically grounded

â€ĸ UoM Thamizhi Morphological Normalizer (University of Moratuwa)
✅ Linguistically motivated rule-based approach
❎ Limited effectiveness on real-world data
❎ Struggled with:

* Noisy OCR text
* Complex suffix chains
* Unseen word forms

â€ĸ Tamil Lemmatizer (final approach)
✅ Consistent root-form extraction
✅ Robust across inflected variants
✅ Best empirical performance in our pipeline

đŸ”Ŧ Key Design Decision

Transformer models do not enforce canonical forms.

👉 Surface forms like:
â€ĸ āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆāŽ¯āŽŋāŽ˛ā¯
â€ĸ āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆāŽ¯āŽŋāŽ˛ā¯āŽŽā¯
â€ĸ āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆāŽ¯āŽŋāŽ˛ā¯‡

are extracted as separate entities

👉 After normalization:
â€ĸ āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆ

This enables many-to-one mapping, critical for system consistency.

🧩 Our Setup

Pipeline:

â€ĸ Document → chunking
â€ĸ Transformer inference (IndicNER)
â€ĸ IOB span merging + filtering
â€ĸ Variant aggregation (prefix-based)

â€ĸ Morphological normalization (UoM explored → Lemmatizer selected)
â€ĸ Entity re-indexing

Example:

āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆāŽ¯āŽŋāŽ˛ā¯ → āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆ
āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆāŽ¯āŽŋāŽ˛ā¯āŽŽā¯ → āŽ‡āŽ˛āŽ™ā¯āŽ•ā¯ˆ
āŽ‡āŽ¨ā¯āŽ¤āŽŋāŽ¯āŽžāŽĩāŽŋāŽ˛ā¯ → āŽ‡āŽ¨ā¯āŽ¤āŽŋāŽ¯āŽž

âžĄī¸ System-Level Challenges We Solved

â€ĸ Agglutinative suffix handling
â€ĸ Variant explosion in entity outputs
â€ĸ OCR/noisy input robustness
â€ĸ Canonical entity consistency across documents

📊 What We Observed

â€ĸ Transformer NER → high recall, low canonical consistency
â€ĸ UoM morphological normalizer → linguistically sound but limited robustness
â€ĸ Lemmatizer → best normalization performance in practice

🌟 Final system:
IndicNER + Lemmatization (hybrid architecture)

âœŗī¸ Key Insight

In Tamil NER, the challenge is not detection —
it is morphological normalization.

NER output ≠ final entity

🌟 Canonicalization is essential for:
â€ĸ Indexing
â€ĸ Entity linking
â€ĸ Knowledge graphs
â€ĸ RAG systems

🚀 Outcome

We built a production-ready Tamil NER system that:

â€ĸ Resolves inflected entity variants
â€ĸ Produces stable canonical forms
â€ĸ Improves downstream retrieval and analytics
â€ĸ Scales across multi-document pipelines

đŸ”Ŧ This work is part of ongoing Tamil NLP system development at CTNLPR

24/04/2026

🚀 𝐁𝐮đĸđĨ𝐝đĸ𝐧𝐠 𝐚 𝐊𝐞𝐲𝐰𝐨đĢ𝐝 𝐄𝐱𝐭đĢ𝐚𝐜𝐭đĸ𝐨𝐧 𝐏đĸ𝐩𝐞đĨđĸ𝐧𝐞 𝐟𝐨đĢ 𝐓𝐚đĻđĸđĨ: 𝐅đĢ𝐨đĻ 𝐒𝐭𝐚𝐭đĸđŦ𝐭đĸ𝐜𝐚đĨ 𝐌𝐞𝐭𝐡𝐨𝐝đŦ 𝐭𝐨 𝐄đĻ𝐛𝐞𝐝𝐝đĸ𝐧𝐠-𝐁𝐚đŦ𝐞𝐝 𝐌𝐨𝐝𝐞đĨđŦ

Keyword extraction is a core component in our NLP pipeline (search, indexing, and RAG).In practice, adapting existing methods for Tamil required careful system-level design, not just model selection.

âš™ī¸ What We Built

We implemented a Tamil-aware keyword extraction pipeline by adapting standard NLP libraries:

â€ĸ scikit-learn → TF-IDF (statistical baseline)
â€ĸ Gensim → TextRank (graph-based ranking)
â€ĸ KeyBERT → embedding-based semantic extraction

Since these tools are not natively designed for Tamil, we integrated preprocessing using Indic NLP techniques (tokenization, normalization).

🧠 Method Evaluation (What Worked / What Didn’t)

â€ĸ TF-IDF
✅ Useful for corpus-level keyword distribution
❎ No semantic understanding

â€ĸ TextRank
✅ Works without training
❎ Highly sensitive to tokenization quality

â€ĸ YAKE
✅ Fast, strong baseline for per-document keywords

â€ĸ KeyBERT (final approach)
✅ Captures semantic relevance
✅ Best performance for Tamil when paired with proper embeddings

đŸ”Ŧ Key Design Decision

KeyBERT itself is not language-aware —it depends entirely on the embedding model.

👉 Using default English embeddings → poor Tamil results

👉 Using Tamil/Indic embeddings → strong semantic extraction

🧩 Our Setup

We integrated KeyBERT with Tamil-capable embedding models:

â€ĸ l3cube-pune/tamil-sentence-bert-nli
â€ĸ ai4bharat/indic-bert
â€ĸparaphrase-multilingual-mpnet-base-v2

Pipeline:

â€ĸ Document → embedding
â€ĸ Candidate n-grams generation
â€ĸ Semantic similarity ranking

🛑 System-Level Challenges We Solved

â€ĸ Tamil stopword handling (custom lists)
â€ĸ Text normalization (spelling variations, diacritics)
â€ĸ Tokenization consistency
â€ĸ Handling low-resource language constraints

📊 What We Observed

â€ĸ TF-IDF → strong for global topic words
â€ĸ YAKE → reliable lightweight baseline
â€ĸ KeyBERT + Tamil embeddings → best semantic keyword quality

💡 Key Insight

In Tamil NLP, keyword extraction is not limited by the algorithm —it is constrained by:

â€ĸ Embedding quality
â€ĸ Tokenization
â€ĸ Text normalization

🚀 Outcome

We built a production-ready Tamil keyword extraction pipeline that:
â€ĸ Produces semantically meaningful keywords
â€ĸ Works across different document types
â€ĸ Integrates seamlessly into downstream RAG systems

đŸ”Ŧ This work is part of ongoing Noolaham GPT development at CTNLPR.

15/04/2026

⚡ 𝐃𝐞đŦđĸ𝐠𝐧đĸ𝐧𝐠 𝐚𝐧 𝐄𝐟𝐟đĸ𝐜đĸ𝐞𝐧𝐭 𝐑𝐞đĢ𝐚𝐧𝐤đĸ𝐧𝐠 𝐋𝐚𝐲𝐞đĢ: 𝐌𝐮đĨ𝐭đĸđĨđĸ𝐧𝐠𝐮𝐚đĨ 𝐂đĢ𝐨đŦđŦ-𝐄𝐧𝐜𝐨𝐝𝐞đĢ 𝐎𝐩𝐭đĸđĻđĸđŗđšđ­đĸ𝐨𝐧 𝐟𝐨đĢ 𝐓𝐚đĻđĸđĨ–𝐄𝐧𝐠đĨđĸđŦ𝐡 𝐑𝐀𝐆

In multilingual RAG systems, dense retrieval can surface relevant chunks, but retrieval alone is not sufficient.
Not all retrieved passages are equally relevant, and passing all candidates directly to the LLM leads to:

â€ĸ Increased token usage
â€ĸ Higher latency
â€ĸ Noisy context → degraded response quality

🔍 Problem

Dense retrievers often fail at ranking precision, especially for mixed-language queries (Tamil, English).

This results in:

â€ĸ Relevant documents ranked lower
â€ĸ Cross-lingual inconsistencies
â€ĸ Reduced downstream LLM answer quality

âš™ī¸ Core Approach

At CTNLPR, we introduce a cross-encoder reranking layer to refine retrieval results.

Unlike bi-encoders, rerankers:

â€ĸ Jointly encode query–document pairs
â€ĸ Capture fine-grained semantic relevance
â€ĸ Improve cross-lingual ranking consistency

👉 This enables accurate ordering of multilingual candidates before generation.

đŸ”Ŧ Model Evaluation

We evaluated multiple multilingual rerankers:

â€ĸ BGE-v2-m3 → high accuracy, higher latency on CPU
â€ĸ jina-v3-multi → strong cross-lingual consistency
â€ĸ jina-v2-cpu-opt → best latency–quality trade-off
â€ĸ gte-multilingual → stable performance

Without reranking, we observed:

â€ĸ Correct documents retrieved but mis-ranked
â€ĸ Ranking instability for mixed-language queries
â€ĸ Noise introduced by lexical fusion methods (e.g., RRF)

🧩 Reranking Pipeline

We adopt a two-stage architecture:

1. Retrieve Top-K candidates (dense retrieval)
2. Apply cross-encoder reranker
3. Score and reorder candidates
4. Pass Top-N results to LLM

⚡ CPU Optimization Strategy

Cross-encoders are computationally expensive, especially in CPU-only environments.
Our objective: maximize ranking quality under strict latency constraints.

1ī¸âƒŖ Candidate Reduction (High Impact)
â€ĸ Reduce Top-K before reranking (e.g., 100 → 20)
â€ĸ Directly minimizes forward passes

💡 Largest performance gain comes from reducing reranker calls

2ī¸âƒŖ ONNX + INT8 Quantization

â€ĸ Convert PyTorch → ONNX
â€ĸ Apply INT8 dynamic quantization

Benefits:
â€ĸ Faster inference
â€ĸ Lower memory usage
â€ĸ Minimal impact on ranking quality

3ī¸âƒŖ Token & Runtime Optimization

â€ĸ Reduce max token length (512 → 256)
â€ĸ Optimize CPU threading (OMP / MKL)
â€ĸ Use efficient tokenization + batching

💡 Self-attention scales as O(n²), making token reduction critical

📊 Performance Signals

â€ĸ Latency reduced from seconds → sub-second range (~100× improvement)
â€ĸ Maintained strong ranking quality (MRR / nDCG)
â€ĸ Stable cross-lingual ranking (Tamil ↔ English)

What Didn’t Work

â€ĸ Similarity threshold filtering → unstable across scripts
â€ĸ RRF (Reciprocal Rank Fusion) → introduces lexical noise

💡 Key Insight

Multilingual RAG is not just a retrieval problem —
it is a ranking precision problem.

â€ĸ Retrieval → ensures coverage
â€ĸ Reranking → ensures correctness

🚀 Outcome

â€ĸ Improved ranking accuracy across languages
â€ĸ Reduced CPU latency to production-ready levels
â€ĸ Efficient, scalable multilingual pipeline
â€ĸ Better handling of mixed-language queries

Multilingual RAG becomes reliable when retrieval and reranking are jointly optimized.

At CTNLPR, we designed and deployed this reranking layer as part of our Tamil–English RAG pipeline, focusing on CPU-efficient cross-lingual ranking for real-world, large-scale document systems.

09/04/2026

âšĄī¸đƒđžđŦđĸ𝐠𝐧đĸ𝐧𝐠 𝐚 𝐁đĸđĨđĸ𝐧𝐠𝐮𝐚đĨ 𝐑𝐀𝐆 𝐒𝐲đŦ𝐭𝐞đĻ: 𝐂đĢ𝐨đŦđŦ-𝐋đĸ𝐧𝐠𝐮𝐚đĨ 𝐃𝐞𝐧đŦ𝐞 𝐑𝐞𝐭đĢđĸđžđ¯đšđĨ 𝐟𝐨đĢ 𝐓𝐚đĻđĸđĨ–𝐄𝐧𝐠đĨđĸđŦ𝐡

In multilingual RAG systems, the key challenge is cross-lingual retrieval — enabling a query in Tamil to retrieve semantically relevant Tamil and English passages from a unified index (and vice versa), without translation pipelines or language-specific partitioning.

âš™ī¸ Core Approach

We rely on multilingual dense encoders that project Tamil and English into a shared semantic vector space, allowing semantically aligned content across languages to be retrieved using standard similarity search.

đŸ”Ŧ Model Evaluation

We evaluated:
â€ĸ Sentence Transformers (SBERT variants)
â€ĸ Indic-specific models (IndicBERT, MuRIL)

Observed limitations:
â€ĸ Weak Tamil–English alignment
â€ĸ Inconsistent cross-lingual similarity distributions
â€ĸ Lower recall in mixed-language retrieval

✅ Selected Model

→ intfloat/multilingual-e5-large

Reasons:
â€ĸ Built on XLM-RoBERTa-large (multilingual pretraining)
â€ĸ Trained with large-scale contrastive objectives (>1B pairs)
â€ĸ Fine-tuned on retrieval benchmarks (MS MARCO, Mr.TyDi, MIRACL)
â€ĸ Instruction-aware embedding (“query:” / “passage:” prefixes)

This results in strong cross-lingual ranking and alignment, especially for low-resource languages.

🧩 Indexing Strategy

We use a unified embedding + single index design:
â€ĸ Chunk all documents (Tamil + English)
â€ĸ Encode using the same model
â€ĸ Store in one vector index
No language-based partitioning.

🔎 Retrieval Flow

1.Encode query (Tamil or English)
2 Perform ANN search (cosine similarity)
3.Retrieve top-k cross-lingual chunks
4.Pass to LLM for response synthesis

📊 Benchmark Signals (MRR / nDCG)

Across multilingual benchmarks and internal evaluations:
â€ĸ MRR@10 ↑ → better early precision in cross-lingual retrieval
â€ĸ nDCG@10 ↑ → improved ranking quality for mixed-language queries
â€ĸ Recall@10 ↑ → higher retrieval coverage (Tamil ↔ English)
â€ĸ More stable cosine similarity distributions across scripts

These gains are primarily driven by large-scale contrastive training + retrieval-specific fine-tuning.

💡 Key Insight

Cross-lingual RAG is not a database problem —it is an embedding alignment problem solved at training time.

🚀 Outcome

â€ĸ Stronger cross-lingual ranking (Mean Reciprocal Rank/nDCG improvements)
â€ĸ No translation overhead
â€ĸ Single index, reduced system complexity
â€ĸ Better knowledge coverage across languages

Multilingual retrieval becomes reliable when both languages share the same semantic space.

Want your school to be the top-listed School/college in Jaffna?

Click here to claim your Sponsored Listing.

Location

Category

Telephone

Address


63, Sir Pon, Thirunelvelly, Ramanathan Road, Kallady
Jaffna
40000

Opening Hours

Monday 09:00 - 17:00
Tuesday 09:00 - 17:00
Wednesday 09:00 - 17:00
Thursday 09:00 - 17:00
Friday 09:00 - 17:00