Task: NLP
Release Date: 12/8/2025
Format: TXT
Size: 14.70 MB
Share
The Indus Kohistani corpus contains around 500k tokens of folktales, stories, poetry, biographies, and conversational texts, all transcribed with a consistent community orthography. Reviewed by native speakers, the corpus offers a representative snapshot of the language’s vocabulary and grammar for linguistic and computational research.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
This corpus may not be used by any organization with an annual revenue exceeding 1 million USD
Forbidden Usage
The corpus is strictly forbidden for use in creating synthetic text, generating hateful or harmful content, or developing tools that enable such outputs
Intended Use
This corpus is intended to support linguistic research, documentation, and community-centered technology development for Indus Kohistani. It is designed to help create educational resources, improve language understanding, and advance inclusive AI tools that benefit the speaker community.
Indus Kohistani (mvy) is an Indo-Aryan language spoken in the upper Indus Valley of northern Pakistan, primarily in Kohistan district. It is used across several villages and valleys, showing noticeable variation in pronunciation and vocabulary between communities. The language is rich in oral traditions, with folktales, poetry, and storytelling serving as important cultural practices. Although widely spoken, it remains under-documented and has limited written materials, making it an important language for linguistic research and resource development.
Folktales and traditional narratives
Oral histories and storytelling
Poetry and songs
Children's stories
Biographies and life narratives
Conversational dialogues
Descriptive and explanatory texts
Proverbs and short sayings
Religious Literature
The processing will combine all plain text, PDF text, digits, and reference images into a clean, organized dataset. Text from PDFs will be extracted directly and standardized for Unicode, spacing, and orthography, while digits and symbols will be cleaned and formatted consistently. The images, which contain no text, will be stored as reference materials in a structured folder. The final output will include uniformly formatted UTF-8 text files and neatly organized reference images.
َ ُ ِ ّ ا ب پ ت ٹ ث چ ڇ څ ح خ د ڈ ذ ر ڑ ز ژ ڙ س ش ݜ ص ض ط ظ ع غ ف ق ک گ ل م ن ݨ و ہ ی
او ڙھا تیں بال لر نی تھی، نہ مہ زُنازُو تیں مُخالیۡفت کرم تُھو چے سِوَیں ژؤن٘دُناں مُختلف حالتیُوں مہ ادا کرَیں لاقَت لہ تَنہی نی ہُوئ تھی۔ دویُوں مِثال ݜے تھُو چے کماݜ لازمی کمہۡ (واجب) مُوڙ (چُن٘ڑ بول)، گُو (تھُل بول) یا