License:
CC-BY-NC-SA-4.0
Steward:
Bahasa BahanaTask: TTS
Release Date: 4/30/2026
Format: WEBM, TSV
Size: 273.32 MB
Share
This dataset supports an underrepresented language in technology by contributing to the development of speech recognition for the Manggarai language. Manggarai is part of the Austronesian language family and is primarily spoken on the island of Flores, Indonesia. As one of Indonesia’s local languages, it occasionally features code-switching and code-mixing with Indonesian and English, although such instances are relatively limited. This dataset provides audio recordings paired with text, representing a variety of topics and everyday expressions.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
This dataset is intended for NLP (AI training), language teaching-learning, and research on under-represented languages. However, if you wish to use it for other purposes, please contact the administrator by submitting an access request with a clear statement of your intended use. Users must comply with the dataset license terms and provide appropriate attribution where required.
Forbidden Usage
You must not attempt to identify or re-identify any individual speaker. You must not use this dataset to clone voices or create systems that imitate specific speakers. You must not use this dataset for malicious, deceptive, or harmful purposes. You must not use this dataset to generate misleading or fraudulent audio content. Re-distribute the dataset into an individual or organization is not allowed. You must not use this dataset for commercial purposes according to the license. Accessing this dataset is only via Mozilla Data Collective and comply with the terms and conditions. Any use that is intended to commit war and violates privacy rights, human rights (animate and inanimate), or applicable laws is strictly prohibited. This dataset may not be used by companies, individuals, or affiliates with any known human rights violations or war crimes. Unattributed use of the dataset is prohibited.
Ethical Review
This dataset was made by a native speaker and considered as a linguist. Initially, this dataset was created by writing short essays of various topics in the Manggarai language with code-mixing in Indonesian and English. The essay files were read and recorded through the hosting platform https://sabre-2.onrender.com/. Finally, the collection of audio recordings was compiled into a comprehensive dataset.
Intended Use
This dataset is made for development of NLP (AI training), language teaching-learning, and research on under-represented languages. This dataset contributes to support the documentation, preservation and conservation of the local language of Indonesia, particularly through technology.
This dataset represents the Manggarai language, which is primarily spoken in the western part of Flores Island, East Nusa Tenggara (NTT), Indonesia, with minor instances of code-switching and code-mixing with Indonesian and English.
This dataset was organically and naturally created by a native speaker of the Manggarai language. An online Indonesian–Manggarai dictionary was used to support accurate vocabulary usage (source: https://anyflip.com/rdptn/xhwq ).
This dataset encompasses a range of general domains, including local culture, family, education, social activities, and technology.
How to cite this dataset: Bahasa Bahana. (2026). BAHANA-Manggarai TTS [dataset]. Mozilla Data Collective. URL (Dataset Link)
We recommend a more in-depth discussion regarding the linguistics of this dataset to community researchers, Mauritio Pamungkas by sending an email pamungkasthyo@gmail.com
5 hours
Audio file name, text
“Manga ngasang hitu Badan Pengawas Obat dan Makanan (BPOM), isé situ ga ata de pemerénta.”
“Manga ata paké media sosial, ného paké tiktok ko instagram kudut pande awar ata weli.”
“Ngoéng ko toé ngoéng néténg-néténg ata paka bacang surak kabar online hitu ai nai koén kawe surak kabar sot toé online lawang mosé ho'o ga.”
“Sanggér surak kabar situ ga tunti paké tombo de ata lau mai Inggris, poli hitu manga kolé surak kabar The Japan Times hitu ga tunti paké tombo de ata lau mai Jepang, poli hitu manga kolé surak kabar Choson Ilbo hotu tunti paké tombo de ata lau mai Korea.”
“Kudut ngancé kin tombo ko cumang tau agu haé olét de ité sot tadang tau hitu ga tombo kaut oné internet agu media sosial ata manga oné HP.”
Latin alphabet (A–Z), Arabic numerals (0–9)
Indonesian-Manggarai online dictionary: https://anyflip.com/rdptn/xhwq
Instagram: https://www.instagram.com/bahasabahana