License:
CC-BY-NC-4.0
Steward:
MirasAITask: NLP
Release Date: 4/21/2026
Format: TXT, DOCX
Size: 7.01 MB
Share
The Rohingya Literature Corpus is an exceptionally rare linguistic resource comprising approximately 613,500 tokens. This dataset is uniquely characterized by its use of the Myanmar (Burmese) script to represent the Rohingya language, which is very difficult to find in available linguistic resources. The corpus mainly consists of diverse articles covering cultural knowledge, folklore, oral traditions, and social themes. To support computational processing and large-scale analysis, the collection is divided into 7 distinct files, each representing specific thematic sub-genres and chronological periods. This dataset provides an important foundation for researchers studying cross-script linguistic phenomena and developing computational tools for the Rohingya community.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes.
Forbidden Usage
Users agree not to attempt to determine the identity of individuals mentioned in the text and are strictly prohibited from using this data for commercial purposes or to train deceptive AI systems.
Ethical Review
Every individual file in this corpus was acquired and compiled following the procurement of explicit informed consent from the original authors via the Haque Publishing Agency.
Intended Use
This dataset is intended for Natural Language Processing (NLP) research and the development of linguistic tools for the Rohingya language.
Rohingya (Ruáingga) is an Eastern Indo-Aryan language belonging to the Indo-Iranian branch. While often written in Hanifi, Latin (Rohingyalish), or Arabic scripts, this specific corpus utilizes the Myanmar (Burmese) script. The script is an abugida written from left to right, historically developed with rounded characters to preserve palm-leaf fibers. As a tonal language, the Rohingya content within this script utilizes specific diacritics to represent varied vocal tones.
33 Consonants: က, ခ, ဂ, ဃ, င, စ, ဆ, ဇ, ဈ, ည, ဋ, ဌ, ဍ, ဎ, ဏ, တ, ထ, ဒ, ဓ, น, ပ, ဖ, ဗ, ဘ, မ, ယ, ရ, လ, ဝ, သ, ဟ, ဠ, အ
Independent Vowels: ဣ, ဤ, ဥ, ဦ, ဧ, ဩ, ဪ
Literature (Articles): A multifaceted domain encompassing cultural folklore, social dynamics, and professional journalistic prose.
Poetry: Aesthetic and cultural expression through tonal verse.
Folklore & Oral Tradition: Written records of traditional stories and community heritage.
Everyday Social Themes: Contextual reflections of social life and contemporary narratives.
Cultural Knowledge & Heritage.
The dataset is organized into two primary directories to maintain data integrity and accessibility:
01-TXT Files (UTF-8-Converted): Standardized, machine-readable plain text files.
02-Original Files (DOCX): Original source documents in Microsoft Word format.
To optimize processing efficiency, the corpus is distributed across 7 files, each serving as a domain or chronological container.
Haque Publishing Agency, Rajshahi, Bangladesh.
01-Rohingya Language Articles Collection - 98500-T.txt/.docx
02-Rohingya Language Articles Collection - 96900-T.txt/.docx
03-Rohingya Language Articles Collection - 86100-T.txt/.docx
04-Rohingya Language Articles Collection - 96000-T.txt/.docx
05-Rohingya Language Articles Collection - 95500-T.txt/.docx
06-Rohingya Language Articles Collection - 94000-T.txt/.docx
07-Rohingya Language Articles Collection - 46500-T.txt/.docx
Detailed Conversion: Every file was meticulously converted from .docx to UTF-8 encoded text using a high-precision methodology.
Unicode Normalization: Standardized to ensure consistent rendering of tones and rare script markers.
Refined Cleanup: Automated and manual removal of stray symbols, markup, and formatting artifacts.
ဟိုတစ်နေ့က နိုင်ငံခြားသား မွစ်လင်မ်သူငယ်ချင်းတစ်ယောက်က အပြာဆန်တဲ့ ဗီဒီယိုရုပ်သံဖိုင် တစ်ခုပြတယ်။သူ မပြခင်ကတည်းကပြောထားတယ်။ဘိုင်…စကားအသံကိုသေချာနားထောင်ပါတဲ့။
The Arakan News သတင်းစာသည် အာရကန်ပြည်၏ ပထမဆုံး အင်္ဂလိပ်ဘာသာစကားဖြင့် ထုတ်ဝေခဲ့သည့် သတင်းစာဖြစ်သည်။ Akyab မြို့သည် ဒေသခံသတင်းစာ The Arakan News ၏ မူလထုတ်ဝေရာမြို့ဖြစ်သည်။ အက္ကီယာပ် (စစ်တွေ) တွင် Akyab Weekly News Press မှ ပုံနှိပ်ထုတ်ဝေခဲ့သည်။
၇။ ရခိုင်အသံကို ဆက်လက်ဖတ်ရှုကြမည်ဟု ကုလားခေါင်းဆောင် အချို့ပြောနေကြသည်ဟုလည်းကောင်း ထည့်သွင်းဖော်ပြထားသည်။
(မှတ်ချက်။ ။တစ်ချိန်က ဒုက္ခအတိမှ ရုန်းထွက်နိုင်ဖို့ ရိုဟင်ဂျာတွေက ဒေါ်အောင်ဆန်းစုကြည်နဲ့ သူရဲ့ ဒဒီမိုကရေစီလမ်းစဉ်ကို ယုံကြည်အားကိုးခဲ့တယ်။ အဲဒီတုန်းက ဖွဲ့ဆိုခဲ့တဲ့ တေးတစ်ပုဒ်ဖြစ်ပါတယ်။)