Task: NLP
Release Date: 1/5/2026
Format: TXT
Size: 1.88 MB
Share
This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlBalochi is an Iranian language (Indo-European family) spoken primarily across Balochistan (Pakistan and Iran) and parts of Afghanistan, with large diaspora communities in the Gulf and elsewhere. It is commonly written in a Perso-Arabic script, and it is widely used in oral traditions such as poetry, proverbs, and riddles, alongside modern writing in novels and journalism. Balochi has several major regional varieties, often grouped as Western, Eastern, and Southern that differ in pronunciation, vocabulary, and some grammatical patterns. In and around Quetta, the variety most commonly associated with everyday use is Western Balochi (bgn).
Literature (Creative writing)
Poetry (Aesthetic / cultural expression)
Journalism & General Writing
Folklore & Oral Tradition (Textual form)
Everyday Social Themes (as reflected in texts)
Cultural Knowledge & Heritage
Language Variation & Style
آ ا ب پ ت ٹ ج چ د ڈ ر ڑ ز ژ س ش ک گ ل ن م ۆ و ه ئ ی ێ ے َ ِ ُ ْ ص ض ط ظ خ ث ع غ ذ ف ق
The dataset has 14 files.
Each file name matches the content inside (e.g., novel.txt, sentences.txt, proverbs.txt, riddles.txt).
Treat each file as a separate genre/domain container.
Raw: original 14 files (unchanged)
Clean: same 14 files after normalization (same filenames)
Include:
file_id, file_name, content_type, language_iso639_3 (bgn), variety (Quetta), script, word_count, cleaning_level, rights_status, license, notes
UTF-8, Unicode normalization, white-space/punctuation cleanup
Remove stray symbols/markup if needed
بلوچی عہدی شاعری ءِ تہا ڈرامہ ءِ کُلّیں سپت موجود انت۔ اگاں ما مروچی حانی ءُ شے مرید ءِ شعری داستان ءَ واناں یا اشکناں تہ اے پیمیں شاعری مارا ہما عہد ءُ دئور ءُ باری ءِ گوازینگ ءِ راہبندانی چپّ ءُ چاگرد ءَ پیش داریت۔ سیوی ءِ جلگہیں شہر انت، میر چاکر ءِ ماڑی ءَ رندانی کچہری ءُ دیوان انت، مُچّی ءُ مراگاہ انت،بندات ءَ شاعر میر چاکر ءِ کردار ءَ دیما کاریت کہ آ دیوان ءِ دیما یک بندے بندیت کہ دیوان ءِ نندوک ہما بند ءِ بوجگ ءُ تچک کنگ ءَ حیران انت: