Marma Text Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

CLEAR Global

Task: LM

Release Date: 4/13/2026

Format: TSV

Size: 188.92 KB


Share

Description

This dataset contains 5,675 sentences in the Marma language (ISO 639-3: rmz), a Tibeto-Burman language spoken primarily by the Marma people in Bangladesh and Myanmar. Each entry includes the original sentence and its normalized form, along with the source of the text. The data was compiled from various sources including textbooks, literature, poems, and linguist-authored sentences. The dataset is split into a training set (5,575 examples) and a test set (100 examples). It was created as part of a project by CLEAR Global funded by the Australian Government Department of Foreign Affairs and Trade (DFAT).

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended solely for non-commercial research and educational purposes. Commercial use requires explicit permission from the original rights holders.

Forbidden Usage

- Commercial use of this dataset without explicit permission from the original rights holders is forbidden. - Any use that does not comply with the CC-BY-NC-SA-4.0 license terms is forbidden. - Redistribution without proper attribution to the original sources and CLEAR Global is forbidden.

Processes

Ethical Review

This dataset was compiled with permission from original content creators and community representatives. CLEAR Global obtained necessary permissions to share this data. The Marma community members and language experts contributed to and validated the dataset.

Intended Use

This dataset is intended for use in developing text normalization systems, language models, and other natural language processing tools for the Marma language. It can also serve as a resource for linguistic research on the Marma language.

Metadata

The dataset was compiled from multiple sources including PCJSS documents, NCTB textbooks, Marma literature and poetry, and sentences authored by a Marma linguist. The text is written in Burmese script. The normalization process standardizes punctuation and formatting while preserving the linguistic content. Please check README.md for more information.