Hindi 10 Million Text Corpus

Description

The Hindi Ten Million Corpus is a curated Hindi text collection of around 10 million tokens drawn from multiple authors. It includes both literary texts and personal articles, making it useful for research in Hindi NLP, corpus linguistics, stylistic analysis, and digital humanities.

Specifics

Licensing

Creative Commons Attribution No Derivatives 4.0 International (CC-BY-ND-4.0)

https://spdx.org/licenses/CC-BY-ND-4.0.html

Considerations

Restrictions/Special Constraints

Proper attribution is required. Redistribution of modified or adapted versions of the dataset is not permitted. Use must comply with the CC-BY-ND-4.0 license and applicable copyright laws.

Forbidden Usage

Users may not distribute modified, remixed, transformed, or adapted versions of this dataset. The dataset may not be used in any way that violates copyright, attribution requirements, or other terms of the CC-BY-ND-4.0 license.

Language

Hindi is an Indo-Aryan language primarily spoken in India and written mainly in the Devanagari script. It is one of the most widely spoken languages in the world and is used across literature, media, education, administration, and everyday communication.

Data Structure

Organized into author-level folders
Each folder is named by the author
Each folder contains that author’s articles or literary texts
Supports authorship analysis, stylistic comparison, and author-wise NLP research
Makes author-based data splitting and corpus analysis easier

Recommended Processing

Normalize text encoding to UTF-8
Standardize Unicode and punctuation forms
Remove formatting noise, extra spaces, and non-text artifacts
Preserve author-wise folder structure during preprocessing
Segment texts into documents, paragraphs, or sentences as needed
Check for duplicates and near-duplicate files
Apply tokenization and script-aware normalization for Hindi NLP tasks
Keep metadata linking each text to its author for downstream analysis

Attribution Requirement

Attribution to the original authors is mandatory*
- Anup Shukla
- Arun Asthana
- Anu Shakti Singh
- Bhumika Dwivedi
- Pratap Sehgal
- Purnima Varman
- Vijay Pandit

Sample

टरनेट के माध्यम से कला-साहित्य-संस्कृति की दुनिया से जुड़े पाठकों रचनाकारों के लिये पूर्णिमावर्मन जाना-पहचाना नाम है। देश दुनिया के कोने-कोने में अभिव्यक्ति एवं अनुभूति के माध्यम से साहित्य-संस्कृति के प्रचार-प्रसार में जुटी पूर्णिमाजी बताती हैं :- हिन्दी में शायद यह पहली पत्रिका होगी जहां संपादक एक देश में निदेशक दूसरे देश में और टाइपिस्ट तीसरे देश में हों। फिर भी सब एक दूसरे को देख सकते हों सुन सकते हों दिन में चार घंटे दो घंटे सुबह और दो घंटे शाम। वो भी तब जब एक की दुनिया में दिन हो और दूसरे की दुनिया में रात। हम आपस में अक्सर कहते हैं, “हम दिन रात काम करते हैं। इसी लिये तो हम दूसरों से बेहतर काम करते हैं”। पीलीभीत की सुंदर घाटियों में जन्मी पूर्णिमाजी को प्रकृतिप्रेम एवं कला के प्रति बचपन से अनुराग रहा। फिर मिर्जापुर व इलाहाबाद में इस अनुराग में साहित्य एवं संस्कृति के रंग भी मिले। संस्कृत साहित्य में स्नातकोत्तर उपाधि, पत्रकारिता तथा वेब डिजाइनिंग में डिप्लोमा पूर्णिमाजी के जीवन का पहला लगाव पत्रकारिता आजतक बना हुआ है। इलाहाबाद के दिनों में अमृतप्रभात, आकाशवाणी के अनुभव आज भी ऊर्जा देते हैं। जलरंग, रंगमंच, संगीत और स्वाध्याय से दोस्ती रखने वाली पूर्णिमाजी पिछले पचीस सालों से संपादन, फ्रीलांसर, अध्यापन, कलाकार, ग्राफिक डिजाइनिंग तथा जाल प्रकाशन के रास्तों से गुजरती हुई फिलहाल अभिव्यक्ति तथा अनुभूति के प्रकाशन तथा कलाकर्म में व्यस्त हैं।