Task: NLP
Release Date: 6/10/2026
Format: TXT
Size: 19.52 MB
Share
The Hindi Literature & News Article Blog Corpus is a comprehensive text dataset curated to support natural language processing research, language modeling, and computational linguistics tasks in Hindi. The corpus brings together two primary content domains, literary writing and news article blogs reflecting the richness and diversity of contemporary and classical Hindi prose. Literary content includes narrative writing, cultural commentary, and reflective articles drawn from established Hindi authors and publications, while the news article blog section captures journalistic writing, opinion pieces, and current affairs coverage in everyday Hindi. All texts are encoded in written in the Devanagari script, ensuring compatibility with modern NLP pipelines and Unicode-compliant tools. This dataset is particularly suited for tasks such as text classification, language modeling, authorship attribution, sentiment analysis, and style transfer, and serves as a valuable resource for researchers and developers working on Hindi language AI systems.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
This dataset is intended solely for research and non-commercial purposes; any commercial use, redistribution, or reproduction of the content without prior authorization is strictly prohibited.
Forbidden Usage
It is forbidden to use this dataset for any commercial, redistributive, or unauthorized purposes beyond academic and non-commercial research.
Intended Use
This dataset is intended for use in Hindi natural language processing research, including language modeling, text classification, sentiment analysis, and authorship attribution tasks.
Hindi (हिन्दी) is an Indo-Aryan language of the Indo-European language family and the most widely spoken language in India. It is the official language of the Government of India and holds official status across numerous Indian states including Uttar Pradesh, Bihar, Madhya Pradesh, Rajasthan, and Uttarakhand. Hindi is spoken by hundreds of millions of speakers across South Asia and among large diaspora communities in the United States, United Kingdom, Mauritius, Fiji, and the Gulf region. According to Glottolog, it belongs to the Central Indo-Aryan group. Hindi shares a high degree of mutual intelligibility with Urdu and is written in the Devanagari script. Most speakers are bilingual in English or a regional language depending on their state and level of education.
Devanagari Script
अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्र, ज्ञ, ं, ः, ँ
Literature: Creative and narrative writing including short stories, prose, reflective essays, and cultural commentary by established Hindi authors.
Article Blog: Informal blog-style articles covering current affairs, social commentary, humour, and everyday observations written for a broad Hindi-speaking audience.
The dataset is organized by author, each containing domain-specific sub-collections:
Hindi Literature & News Article Blog Corpus/
│
├── Anu Sakti Singh/
│ └── Literature/
│ ├── 01-Hindi Literature Collection.txt
│ └── ...
│
├── Anup Shukla/
│ └── Article Blog/
│ ├── 01-Hindi Article Blog Collection.txt
│ └── ...
│
├── Arun Kumar Sharma/
│ └── Literature/
│ ├── 01-Hindi Literature Collection.txt
│ └── ...
│
├── Bhumika Dwivedi Ask/
│ └── Literature/
│ ├── 01-Hindi Literature Collection.txt
│ └── ...
│
├── Pratap Sehgal/
│ └── Literature/
│ ├── 01-Hindi Literature Collection.txt
│ └── ...
│
├── Purnima Burman/
│ └── Literature/
│ ├── 01-Hindi Literature Collection.txt
│ └── ...
│
└── Vijay Pandit/
└── Literature/
├── 01-Hindi Literature Collection.txt
└── ...
Anu Sakti Singh — Literature
Anup Shukla — Article Blog
Arun Kumar Sharma — Literature
Bhumika Dwivedi Ask — Literature
Pratap Sehgal — Literature
Purnima Burman — Literature
Vijay Pandit — Literature
| Field | Details |
|---|---|
| Dataset Name | Hindi Literature & News Article Blog Corpus |
| Language | Hindi (हिन्दी) |
| Language Family | Indo-European — Indo-Aryan Branch |
| Number of Authors | 7 |
| Number of Domains | 2 (Literature, Article Blog) |
| File Format | Plain Text (.txt) |
Format: Plain Text (.txt)
इस बार की चिटठा चर्चा कुश की कलम से.. नमस्कार, मैं कुश आपका प्रिय न्यूज़ रीडर स्वागत करता हूँ आपका चिट्ठा चर्चा में।
पिछले दिनों हमने आपको बताया था की किस प्रकार एक ब्लॉगर ने टिप्पणियों के अभाव में दम तोड़ा।
हमारे जयपुर के संवाददाता ने बताया की ब्लॉग जगत का एक लोक प्रिय किरदार 'सुदामा' अब अभिषेक जी के कार्टून में भी देखा गया।
टिप्पणियों का सूचकांक कल शाम 11217.50 तक रहा। समीर जी के शहर में ना होने की वजह से इसमें 3 फीसदी गिरावट आई है।
चलते चलते अहमद फ़राज़ साहब को चिट्ठा चर्चा की और से हार्दिक श्रद्धांजलि।