Task: NLP
Release Date: 6/10/2026
Format: TXT
Size: 41.85 MB
Share
This dataset is a comprehensive Tamil text corpus compiled from diverse sources to capture the richness and variability of the Tamil language. It includes a wide range of text types such as news articles, literature, online content, and conversational data, making it suitable for multiple natural language processing (NLP) tasks. The corpus has been cleaned and preprocessed to remove noise while preserving linguistic nuances, ensuring high-quality input for model training and evaluation. It can be used for applications such as language modeling, machine translation, sentiment analysis, and text classification. Additionally, the dataset aims to support research and development for low-resource language technologies and promote inclusivity in AI systems by strengthening resources for Tamil.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes.
Forbidden Usage
Users agree not to attempt to determine the identity of individuals within the text and are strictly prohibited from using this data for commercial purposes or the training of harmful generative models.
Intended Use
This dataset is intended for use in developing and evaluating natural language processing models for Tamil, including tasks such as language modeling, text classification, and machine translation.
Tamil (தமிழ்) is a Dravidian language of the South Dravidian branch and one of the longest-surviving classical languages in the world, with a literary tradition spanning over 2,000 years. It is the official language of the Indian state of Tamil Nadu and the union territory of Puducherry, and holds official status in Sri Lanka and Singapore. Tamil is widely spoken across Malaysia, Mauritius, and among large diaspora communities globally. According to Glottolog, it belongs to the Southern Dravidian group alongside Kannada and Malayalam. Most speakers are bilingual in English or Hindi depending on their region and level of education.
Tamil Script
அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, க, ங, ச, ஞ, ட, ண, த, ந, ப, ம, ய, ர, ல, வ, ழ, ள, ற, ன, ஜ, ஷ, ஸ, ஹ, க்ஷ, ஸ்ரீ, ், ா, ி, ீ, ு, ூ, ெ, ே, ை, ொ, ோ, ௌ, ஂ, ஃ
Historical Culture Article Blog: In-depth articles and blog posts covering Tamil historical events, cultural heritage, classical literature, philosophy, and civilizational narratives written by author Santhanam Swaminathan.
Tamil Articles: General-purpose articles spanning a wide range of topics including society, politics, religion, and everyday life written for a broad Tamil-speaking audience.
Tamil Blogs: Informal blog-style writing capturing personal narratives, opinion pieces, cultural commentary, and regional perspectives across multiple years.
The dataset is organized into two primary directories:
Tamil Text Corpus/
│
├── Tamil Text Corpus First/
│ │
│ └── Santhanam Swaminathan/
│ │
│ └── Historical Culture Article Blog/
│
│
└── Tamil Text Corpus Second/
│
├── 2012/
│ ├── 01-Tamil Text Collection 2012.txt
│ └── ...
│
├── 2013/
│ ├── 01-Tamil Text Collection 2013.txt
│ └── ...
│
├── 2014/
│ └── ...
│
├── 2015/
│ └── ...
│
├── 2016/
│ └── ...
│
├── 2017/
│ └── ...
│
├── 2018/
│ └── ...
│
├── 2019/
│ └── ...
│
├── 2020/
│ └── ...
│
├── 2021/
│ └── ...
│
├── 2022/
│ └── ...
│
├── 2023/
│ └── ...
│
├── 2024/
│ └── ...
│
├── 2025/
│ └── ...
│
└── 2026/
├── 01-Tamil Text Collection 2026.txt
└── ...
Contains curated literary content authored by Santhanam Swaminathan, organized under a single domain — Historical Culture Article Blog. This collection focuses on Tamil historical narratives, cultural heritage, and classical literary commentary.
Contains year-wise Tamil text collections spanning 2012 to 2026, capturing a broad range of articles and blog content across multiple contributors and topics over a 14-year period.
Santhanam Swaminathan — Historical Culture Article Blog (Tamil Text Corpus First)
Tamil Text Corpus — Year-wise Tamil articles and blog collections from 2012 to 2026 (Tamil Text Corpus Second)
| Field | Details |
|---|---|
| Dataset Name | Tamil Text Corpus |
| Language | Tamil (தமிழ்) |
| Language Family | Dravidian — South Dravidian Branch |
| Script | Tamil Script (Unicode) |
| Number of Authors | (Santhanam Swaminathan) |
| Number of Domains | 3 (Historical Culture Article Blog, Tamil Articles, Tamil Blogs) |
| File Format | Plain Text (.txt) |
| Total Token Count | 4,839,830 words |
| Coverage | 2012 – 2026 |
| Annotation | Unannotated — raw natural text |
Format: Plain Text (.txt)
Naming Convention: [##]-Tamil [Domain] Collection.txt
தாயுமானவருடன் 60 வினாடி பேட்டி
எல்லோரும் இன்புற்றிருக்க நினைப்பதுவே அல்லாமல் வேறொன்றும் அறியோம் பரபரமே
மண்ணும் மறிகடலும் மற்றுளவும் எல்லாம் உன் கண்ணில் இருக்கவும் நான் கண்டேன் பராபரமே
வேறுபடும் சமயம் எல்லாம் புகுந்து பார்க்கின் விளங்கு பரம் பொருளே! நின் விளையாட்டல்லால்
சைவ சமயமே சமயம் சமயாதீதப் பழம்பொருளைக் கை வந்திடவே மன்றுல் வெளி காட்டும் இந்தக் கருத்தை