Khmer ASR Cultural Dataset (Version 3 - Part 7)

Khmer ASR Cultural Dataset

81.01 hours of manually curated speech-text pairs by native speakers in the Khmer language about Cambodian cultural topics. On average, each recording is 8 seconds. Speaker metadata (gender, age group, and origin city) is provided.

Language: Khmer (khm).
Source(s): Native speakers from Cambodia (5 females, 7 males). The utterances were manually generated based on topics and subtopics listed in metadata.
Domain(s): Cultural domain, with a total of 4 topics and 172 subtopics.
Size: 50GB data instances
WAV file names are formatted as: {speaker_id}khm{sentence_id}.wav.

Sample

The first row of our metadata.csv:

Topic	Subtopic	Speaker ID	Paragraph ID	Sentence ID	Sentences
Recipes	Street food dishes	f-adt1-0001	1	recipes_01_0001_0001	មុខម្ហូបតាមដងផ្លូវ គឺជាមុខម្ហូបមួយមានភាពសម្បូរបែប និងមានភាពងាយស្រួល ដែលគេពេញនិយមក្នុងការបរិភោគ ថែមទាំងមានតម្លៃសមរម្យ។

Khmer ASR Cultural Dataset is also available on HuggingFace.

Use cases

Automatic speech recognition (ASR)

Off-the-shelf state-of-the-art multilingual automatic speech recognition pre-trained models (e.g., OpenAI's Whisper) cannot transcribe Khmer well. Even with further fine-tuning, the error rate (lower is better, 0% means no errors/perfect) for Khmer ASR is far from usable (Lovenia, 2025). See the khm column in Figure 3 below.

image

To have a good automatic speech recognition (ASR) model for Khmer, you will require a large amount of speech-text pairs in Khmer. However, before Khmer ASR Cultural Dataset is available, there was only one Khmer speech-text dataset: OpenSLR 42 with 3.97 hours of speech-text pairs (male only).

Our preliminary experiment shows that even only by adding 650 speech-text pairs from DDD's dataset to the training data, we can decrease the Whisper models' CER by around 0.46%-0.74% compared to only using OpenSLR 42 in the training data. Now the Whisper Large V2's performance in Khmer drops to only 8.11% CER. With more speech-text pairs collected by DDD, ASR models' performance in Khmer will definitely be able to transcribe Khmer audios with even less errors.

Other potential use cases

Khmer ASR Cultural Dataset can also be used to train models on Khmer text-to-speech (TTS), language modeling, topic modeling, and next sentence prediction.

Attribution

Khmer ASR Cultural Dataset's license is Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0). Please attribute Digital Divide Data if you use this dataset in any way.

Khmer ASR Cultural Dataset (Version 3 - Part 7)

Description

Specifics

Considerations

Processes

Metadata