Marma Text Corpus
License:
CC-BY-NC-SA-4.0
Steward:
CLEAR GlobalTask: LM
Release Date: 4/13/2026
Format: TSV
Size: 188.92 KB
Share
Description
This dataset contains 5,675 sentences in the Marma language (ISO 639-3: rmz), a Tibeto-Burman language spoken primarily by the Marma people in Bangladesh and Myanmar. Each entry includes the original sentence and its normalized form, along with the source of the text. The data was compiled from various sources including textbooks, literature, poems, and linguist-authored sentences. The dataset is split into a training set (5,575 examples) and a test set (100 examples). It was created as part of a project by CLEAR Global funded by the Australian Government Department of Foreign Affairs and Trade (DFAT).
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended solely for non-commercial research and educational purposes. Commercial use requires explicit permission from the original rights holders.
Forbidden Usage
- Commercial use of this dataset without explicit permission from the original rights holders is forbidden. - Any use that does not comply with the CC-BY-NC-SA-4.0 license terms is forbidden. - Redistribution without proper attribution to the original sources and CLEAR Global is forbidden.
Processes
Ethical Review
This dataset was compiled with permission from original content creators and community representatives. CLEAR Global obtained necessary permissions to share this data. The Marma community members and language experts contributed to and validated the dataset.
Intended Use
This dataset is intended for use in developing text normalization systems, language models, and other natural language processing tools for the Marma language. It can also serve as a resource for linguistic research on the Marma language.
Metadata
The dataset was compiled from multiple sources including PCJSS documents, NCTB textbooks, Marma literature and poetry, and sentences authored by a Marma linguist. The text is written in Burmese script. The normalization process standardizes punctuation and formatting while preserving the linguistic content. Please check README.md for more information.