Task: NLP
Release Date: 6/2/2026
Format: TXT
Size: 4.20 MB
Share
This dataset is a curated Marathi text corpus developed to support research and development in Natural Language Processing (NLP) and language technology for the Marathi language. The corpus contains diverse textual content collected from publicly available written sources, providing rich linguistic variety in vocabulary, sentence structure, and writing styles. The dataset is intended for tasks such as language modeling, text classification, machine translation, sentiment analysis, summarization, information retrieval, and other NLP applications involving Marathi. It can also be used for linguistic analysis, educational purposes, and the development of low-resource language technologies. Special care has been taken to organize and clean the text data to improve usability for researchers, students, and developers. The dataset aims to contribute to the growth of open Marathi language resources and encourage further research in Indic language processing.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is intended for research, educational, and non-commercial natural language processing applications. Users must ensure that the dataset is not used for unlawful, harmful, discriminatory, or misleading activities
Forbidden Usage
Any attempt to use this dataset for harmful, misleading, or discriminatory content generation is prohibited.
Intended Use
This dataset is intended for research, education, and the development of Marathi natural language processing applications.
Marathi (मराठी) is an Indo-Aryan language of the Indo-European language family, belonging to the Southern Indo-Aryan branch. It is the official language of the Indian state of Maharashtra and is widely spoken in Goa, Karnataka, Madhya Pradesh, and among diaspora communities in the United States, United Kingdom, and the Gulf region. According to Glottolog, it belongs to the Southern Indo-Aryan group alongside Konkani and Sinhalese. Marathi has a rich literary tradition dating back over a thousand years, with significant contributions in poetry, prose, philosophy, and journalism. Most speakers are bilingual in Hindi or English depending on their region and level of education.
Devanagari Script
अ, आ, इ, ई, उ, ऊ, ऋ, ए, ऐ, ओ, औ, अं, अः, क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध, न, प, फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, ळ, क्ष, ज्ञ, ं, ः, ँ
Blog Articles: Informal blog-style writing covering current affairs, social commentary, political opinion, and everyday observations for a broad Marathi-speaking audience.
Literature: Creative and narrative writing including prose, short stories, reflective essays, and cultural commentary by established Marathi authors.
Literature & News Articles: A blend of literary writing and journalistic content covering regional news, cultural events, and social issues.
Editorial Opinion Article: Opinion-driven editorial writing addressing political, social, and civic issues from a critical and analytical perspective.
The dataset is organized by author, each containing domain-specific sub-collections:
Marathi Text Corpus/
│
├── Gurudatta Sohono/
│ └── Blog Articles/
│ ├── 01-Marathi Blog Article Collection.txt
│ └── ...
│
├── Mitraho/
│ ├── Blog Articles/
│ │ ├── 01-Marathi Blog Article Collection.txt
│ │ └── ...
│ └── Blog Articles 1/
│ ├── 01-Marathi Blog Article Collection 1.txt
│ └── ...
│
├── Mohana Joglekar/
│ └── Literature and News Articles/
│ ├── 01-Marathi Literature and News Article Collection.txt
│ └── ...
│
├── Mynac/
│ └── Blog Article/
│ ├── 01-Marathi Blog Article Collection.txt
│ └── ...
│
├── Prakash Ghatpande/
│ ├── Blog Article 1/
│ │ └── ...
│ ├── Blog Article 2/
│ │ └── ...
│ ├── Blog Article 3/
│ │ └── ...
│ ├── Blog Article 4/
│ │ └── ...
│ ├── Blog Article 5/
│ │ └── ...
│ └── Literature/
│ ├── 01-Marathi Literature Collection.txt
│ └── ...
│
└── Vinit Wakhede/
└── Editorial Opinion Article/
├── 01-Marathi Editorial Opinion Article Collection.txt
└── ...
Gurudatta Sohono — Blog Articles
Mitraho — Blog Articles, Blog Articles 1
Mohana Joglekar — Literature and News Articles
Mynac — Blog Article
Prakash Ghatpande — Blog Article 1, 2, 3, 4, 5, Literature
Vinit Wakhede — Editorial Opinion Article
| Field | Details |
|---|---|
| Dataset Name | Marathi Text Corpus |
| Language | Marathi (मराठी) |
| Language Family | Indo-European — Indo-Aryan Branch |
| Script | Devanagari Script (Unicode) |
| Number of Authors | 6 |
| Number of Domains | 4 (Blog Articles, Literature, Literature & News Articles, Editorial Opinion Article) |
| File Format | Plain Text (.txt) |
Format: Plain Text (.txt)
Naming Convention: Marathi [Domain] Collection.txt`
चारपाच दिवसांपूर्वी महाराष्ट्रातील तमाम मिडिया एका कळीने अख्खा दिवस कासावीस झाला होता. एकच क्लिप वारंवार घासून दाखवली जात होती.
अख्खा महाराष्ट्र आज दुष्काळाच्या टकमक टोकावर उभा आहे, विदर्भातीलच नव्हे तर मराठवाडा-कोकण इतकेच काय कृषीसधन पश्चिम महाराष्ट्रातील शेतकर्यावर देखील या वर्षी आत्महत्या करण्याची पाळी येण्यासारखी स्थिती आहे.
टी.आर.पी. मंगता है भाय! या सर्व बाबतीत आमचा मिडिया चिडीचुप आहे. कारण त्यात टी.आर.पी. नाही.
राज-उद्धव कहाणीत मात्र टी.आर.पी. ला लागणारा सर्व मालमसाला आहे. अॅक्शन, इमोशन, मेलोड्रामा सर्वकाही.
पिक्चर अभी बाकी है.. आगामी विधानसभा निवडणुकीच्या अगदी तोंडावर एक होण्याचा निर्णय घेता येणार नाही अशा परिस्थितीत तिकीट वाटपाचे अनेक पेच निर्माण होतील.