Task: ASR
Release Date: 5/20/2026
Format: WAV
Size: 29.93 GB
Share
81.01 hours of manually curated speech-text pairs by native speakers in the Khmer language about Cambodian cultural topics. On average, each recording is 8 seconds. Speaker metadata (gender, age group, and origin city) is provided.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
Please attribute Digital Divide Data if you use this dataset in any way.
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset.
Intended Use
Khmer ASR Cultural Dataset can also be used to train models on Khmer text-to-speech (TTS), language modeling, topic modeling, and next sentence prediction.
81.01 hours of manually curated speech-text pairs by native speakers in the Khmer language about Cambodian cultural topics. On average, each recording is 8 seconds. Speaker metadata (gender, age group, and origin city) is provided.
Language: Khmer (khm).
Source(s): Native speakers from Cambodia (5 females, 7 males). The utterances were manually generated based on topics and subtopics listed in metadata.
Domain(s): Cultural domain, with a total of 4 topics and 172 subtopics.
Size: 50GB data instances
WAV file names are formatted as: {speaker_id}khm{sentence_id}.wav.
The first row of our metadata.csv:
| Topic | Subtopic | Speaker ID | Paragraph ID | Sentence ID | Sentences |
|---|---|---|---|---|---|
| Recipes | Street food dishes | f-adt1-0001 | 1 | recipes_01_0001_0001 | មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។ |
Khmer ASR Cultural Dataset is also available on HuggingFace.
Off-the-shelf state-of-the-art multilingual automatic speech recognition pre-trained models (e.g., OpenAI's Whisper) cannot transcribe Khmer well. Even with further fine-tuning, the error rate (lower is better, 0% means no errors/perfect) for Khmer ASR is far from usable (Lovenia, 2025). See the khm column in Figure 3 below.
image
To have a good automatic speech recognition (ASR) model for Khmer, you will require a large amount of speech-text pairs in Khmer. However, before Khmer ASR Cultural Dataset is available, there was only one Khmer speech-text dataset: OpenSLR 42 with 3.97 hours of speech-text pairs (male only).
Our preliminary experiment shows that even only by adding 650 speech-text pairs from DDD's dataset to the training data, we can decrease the Whisper models' CER by around 0.46%-0.74% compared to only using OpenSLR 42 in the training data. Now the Whisper Large V2's performance in Khmer drops to only 8.11% CER. With more speech-text pairs collected by DDD, ASR models' performance in Khmer will definitely be able to transcribe Khmer audios with even less errors.
Khmer ASR Cultural Dataset can also be used to train models on Khmer text-to-speech (TTS), language modeling, topic modeling, and next sentence prediction.
Khmer ASR Cultural Dataset's license is Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). Please attribute Digital Divide Data if you use this dataset in any way.