BAHANA-Manggarai TTS

Description

This dataset supports an underrepresented language in technology by contributing to the development of speech recognition for the Manggarai language. Manggarai is part of the Austronesian language family and is primarily spoken on the island of Flores, Indonesia. As one of Indonesia’s local languages, it occasionally features code-switching and code-mixing with Indonesian and English, although such instances are relatively limited. This dataset provides audio recordings paired with text, representing a variety of topics and everyday expressions.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for NLP (AI training), language teaching-learning, and research on under-represented languages. However, if you wish to use it for other purposes, please contact the administrator by submitting an access request with a clear statement of your intended use. Users must comply with the dataset license terms and provide appropriate attribution where required.

Language:

This dataset represents the Manggarai language, which is primarily spoken in the western part of Flores Island, East Nusa Tenggara (NTT), Indonesia, with minor instances of code-switching and code-mixing with Indonesian and English.

Source:

This dataset was organically and naturally created by a native speaker of the Manggarai language. An online Indonesian–Manggarai dictionary was used to support accurate vocabulary usage (source: https://anyflip.com/rdptn/xhwq ).

Domains:

This dataset encompasses a range of general domains, including local culture, family, education, social activities, and technology.

Additional Information:

How to cite this dataset: Bahasa Bahana. (2026). BAHANA-Manggarai TTS [dataset]. Mozilla Data Collective. URL (Dataset Link)

We recommend a more in-depth discussion regarding the linguistics of this dataset to community researchers, Mauritio Pamungkas by sending an email pamungkasthyo@gmail.com

Size:

5 hours

Structure:

Audio file name, text

Sample:

“Manga ngasang hitu Badan Pengawas Obat dan Makanan (BPOM), isé situ ga ata de pemerénta.”

“Manga ata paké media sosial, ného paké tiktok ko instagram kudut pande awar ata weli.”

“Ngoéng ko toé ngoéng néténg-néténg ata paka bacang surak kabar online hitu ai nai koén kawe surak kabar sot toé online lawang mosé ho'o ga.”

“Sanggér surak kabar situ ga tunti paké tombo de ata lau mai Inggris, poli hitu manga kolé surak kabar The Japan Times hitu ga tunti paké tombo de ata lau mai Jepang, poli hitu manga kolé surak kabar Choson Ilbo hotu tunti paké tombo de ata lau mai Korea.”

“Kudut ngancé kin tombo ko cumang tau agu haé olét de ité sot tadang tau hitu ga tombo kaut oné internet agu media sosial ata manga oné HP.”

Writing System:

Latin alphabet (A–Z), Arabic numerals (0–9)

Useful Links:

Indonesian-Manggarai online dictionary: https://anyflip.com/rdptn/xhwq

Instagram: https://www.instagram.com/bahasabahana

LinkedIn: https://www.linkedin.com/company/bahasa-bahana