License:
CC-BY-SA-4.0
Steward:
Bahasa BahanaTask: TTS
Release Date: 4/30/2026
Format: WEBM, TSV
Size: 316.13 MB
Share
BAHANA-Betawi TTS is a Betawi language dataset that represents the language dynamics of communities around Indonesia's urban administrative centers. This dataset consists of Betawi variations found in West Java Province and Betawi dialects around the center of Jakarta Province, resulting in strong contact between Indonesian and English. This dataset covers a wide range of topics, from everyday activities to modernity. This dataset can be used for AI (NLP) training, language teaching-learning, and linguistic research on unrepresented languages.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
This dataset is intended for NLP (AI training), teaching, and research. However, if you wish to use it for other purposes, please contact the administrator by submitting an access request with a clear statement of your intended use. Use of the dataset for commercial derivative products requires contacting the admin, and Bahasa Bahana will receive various compensation, which will be discussed further. This dataset is open to the public and is not exclusively granted to any party. This dataset is released for open use. It may be used for research, education, and commercial applications by a company earning not over 10 million dollars per year. For commercial use, explicit permission must be obtained from the data owner through an access request. Users must comply with the dataset license terms and provide appropriate attribution where required.
Forbidden Usage
You must not attempt to identify or re-identify any individual speaker. You must not use this dataset to clone voices or create systems that imitate specific speakers. You must not use this dataset for malicious, deceptive, or harmful purposes. You must not use this dataset to generate misleading or fraudulent audio content. Re-distribute the dataset into an individual or organization is not allowed. Accessing this dataset is only via Mozilla Data Collective and comply with the terms and conditions. Any use that is intended to commit war and violates privacy rights, human rights (animate and inanimate), or applicable laws is strictly prohibited. This dataset may not be used by companies, individuals, or affiliates with any known human rights violations or war crimes. Unattributed use of the dataset is prohibited.
Ethical Review
This dataset was made by native speakers and they are considered linguists. All participants were informed and gave consent to make this dataset. Initially, this dataset was created by writing short essays of various topics in the Betawi language with code-mixing in Indonesian and English. The essay files were read and recorded through the hosting platform https://sabre-2.onrender.com/. Finally, the collection of audio recordings was compiled into a comprehensive dataset.
Intended Use
This dataset is made for development of NLP (AI training), language teaching-learning, and research on under-represented languages. This dataset contributes to support the documentation, preservation and conservation of the local language of Indonesia, particularly through technology.
This dataset uses the Betawi language of West Java Province with urban Jakarta Indonesian and English code-mixing and code-switching.
This dataset was created organically by a team of the dataset creators and they are considered as native speakers and linguists.
This dataset comprises a wide range of topics within the contexts of local culture, family, local education, social activities, technology, etc.
How to cite this dataset: Bahasa Bahana. (2026). BAHANA-Betawi TTS [dataset]. Mozilla Data Collective. URL (Dataset Link)
We recommend a more in-depth discussion regarding the linguistics of this dataset to community researchers, Riska Legistari Febri by sending an email to riskalegistari25@gmail.com
5 hours
audio_filename, sentence
“Di lingkungan tempat aye merantau saat ini jarang liat yang bace koran, tapi masih ade itu juge gak banyak.”
“Kalo di Jawa Barat, ade web khusus yang aye tau, yaitu Tribun Jabar, PR (Pikiran Rakyat), dan Radar Bandung, soalnye suka muncul notifikasi di HP kaye Detik.com ame CNN setiap ade berite terkini langsung muncul.”
“Zaman sekarang yang modern ini, lebih ke digitalisasi serbe online serte meminimalisir sampeh kertas juge dan online lebih praktis tinggal scrol-scrol aje deh.”
“Ane rasa, niat baik dan kebahagiaan orang yang dikasih hadiah jauh lebih penting daripada mahal atau banyaknya barang yang dikasih.”
“Kalo jualannye ga laku, ya paling di diskon, kaye beli 1 gratis 1 atau diskon sampe 50-70% terutame kadaluwarsanye bentar lagi.”
“Kalo suatu negare punye ketahanan pangan sendiri juge, ketike terjadi krisis global kaye pandemi covid-19 kemaren, bencane, perang, jadi kite ga kebingungan."
Latin alphabet (A–Z), Arabic numerals (0–9)
Radio Betawi: https://www.bensradio.com/kamus-bahasa-betawi/
Instagram : https://www.instagram.com/bahasabahana