Tamil Literature Corpus

Description

This dataset is a comprehensive Tamil text corpus compiled from diverse sources to capture the richness and variability of the Tamil language. It includes a wide range of text types such as news articles, literature, online content, and conversational data, making it suitable for multiple natural language processing (NLP) tasks. The corpus has been cleaned and preprocessed to remove noise while preserving linguistic nuances, ensuring high-quality input for model training and evaluation. It can be used for applications such as language modeling, machine translation, sentiment analysis, and text classification. Additionally, the dataset aims to support research and development for low-resource language technologies and promote inclusivity in AI systems by strengthening resources for Tamil.

Specifics

Tamil Text Corpus Dataset

Language

Tamil (தமிழ்) is a Dravidian language of the South Dravidian branch and one of the longest-surviving classical languages in the world, with a literary tradition spanning over 2,000 years. It is the official language of the Indian state of Tamil Nadu and the union territory of Puducherry, and holds official status in Sri Lanka and Singapore. Tamil is widely spoken across Malaysia, Mauritius, and among large diaspora communities globally. According to Glottolog, it belongs to the Southern Dravidian group alongside Kannada and Malayalam. Most speakers are bilingual in English or Hindi depending on their region and level of education.

Script

Tamil Script

அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, க, ங, ச, ஞ, ட, ண, த, ந, ப, ம, ய, ர, ல, வ, ழ, ள, ற, ன, ஜ, ஷ, ஸ, ஹ, க்ஷ, ஸ்ரீ, ், ா, ி, ீ, ு, ூ, ெ, ே, ை, ொ, ோ, ௌ, ஂ, ஃ

Domains of the Text

Historical Culture Article Blog: In-depth articles and blog posts covering Tamil historical events, cultural heritage, classical literature, philosophy, and civilizational narratives written by author Santhanam Swaminathan.
Tamil Articles: General-purpose articles spanning a wide range of topics including society, politics, religion, and everyday life written for a broad Tamil-speaking audience.
Tamil Blogs: Informal blog-style writing capturing personal narratives, opinion pieces, cultural commentary, and regional perspectives across multiple years.

Dataset Structure

The dataset is organized into two primary directories:

Tamil Text Corpus/
│
├── Tamil Text Corpus First/
│   │
│   └── Santhanam Swaminathan/
│       │
│       └── Historical Culture Article Blog/
│          
│
└── Tamil Text Corpus Second/
    │
    ├── 2012/
    │   ├── 01-Tamil Text Collection 2012.txt
    │   └── ...
    │
    ├── 2013/
    │   ├── 01-Tamil Text Collection 2013.txt
    │   └── ...
    │
    ├── 2014/
    │   └── ...
    │
    ├── 2015/
    │   └── ...
    │
    ├── 2016/
    │   └── ...
    │
    ├── 2017/
    │   └── ...
    │
    ├── 2018/
    │   └── ...
    │
    ├── 2019/
    │   └── ...
    │
    ├── 2020/
    │   └── ...
    │
    ├── 2021/
    │   └── ...
    │
    ├── 2022/
    │   └── ...
    │
    ├── 2023/
    │   └── ...
    │
    ├── 2024/
    │   └── ...
    │
    ├── 2025/
    │   └── ...
    │
    └── 2026/
        ├── 01-Tamil Text Collection 2026.txt
        └── ...

Tamil Text Corpus First

Contains curated literary content authored by Santhanam Swaminathan, organized under a single domain — Historical Culture Article Blog. This collection focuses on Tamil historical narratives, cultural heritage, and classical literary commentary.

Tamil Text Corpus Second

Contains year-wise Tamil text collections spanning 2012 to 2026, capturing a broad range of articles and blog content across multiple contributors and topics over a 14-year period.

Authors & Collections

Santhanam Swaminathan — Historical Culture Article Blog (Tamil Text Corpus First)
Tamil Text Corpus — Year-wise Tamil articles and blog collections from 2012 to 2026 (Tamil Text Corpus Second)

Metadata

Field	Details
Dataset Name	Tamil Text Corpus
Language	Tamil (தமிழ்)
Language Family	Dravidian — South Dravidian Branch
Script	Tamil Script (Unicode)
Number of Authors	(Santhanam Swaminathan)
Number of Domains	3 (Historical Culture Article Blog, Tamil Articles, Tamil Blogs)
File Format	Plain Text (.txt)
Total Token Count	4,839,830 words
Coverage	2012 – 2026
Annotation	Unannotated — raw natural text

File Format

Format: Plain Text (.txt)
Naming Convention: [##]-Tamil [Domain] Collection.txt

Sample Text

தாயுமானவருடன் 60 வினாடி பேட்டி
எல்லோரும் இன்புற்றிருக்க நினைப்பதுவே அல்லாமல் வேறொன்றும் அறியோம் பரபரமே
மண்ணும் மறிகடலும் மற்றுளவும் எல்லாம் உன் கண்ணில் இருக்கவும் நான் கண்டேன் பராபரமே
வேறுபடும் சமயம் எல்லாம் புகுந்து பார்க்கின் விளங்கு பரம் பொருளே! நின் விளையாட்டல்லால்
சைவ சமயமே சமயம் சமயாதீதப் பழம்பொருளைக் கை வந்திடவே மன்றுல் வெளி காட்டும் இந்தக் கருத்தை

Description

Specifics

Considerations

Processes

Metadata

Tamil Text Corpus Dataset

Language

Script

Domains of the Text

Dataset Structure

Tamil Text Corpus First

Tamil Text Corpus Second

Authors & Collections

Metadata

File Format

Sample Text