Synthetic Text Corpus for African Language ASR

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

CLEAR Global

Task: NLP

Release Date: 4/13/2026

Format: TSV

Size: 746.63 KB


Share

Description

This dataset contains 13,488 synthetic sentences across 10 African languages (Bambara, Chichewa, Hausa, Kanuri, Luo, Nande, Somali, Twi, Wolof, Yoruba) generated using large language models (GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet). Each sentence has been evaluated by human linguists on readability and naturalness (1-7 scale), translation adequacy and accuracy (1-7 scale), grammatical correctness, word validity, and presence of notable errors. Corrected versions are provided where applicable. The dataset was created by Dimagi to support ASR, NLP research, and evaluation for low-resource African languages. See: DeRenzi et al. (2025), "Synthetic Voice Data for Automatic Speech Recognition in African Languages", arXiv:2507.17578.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Non-commercial use only. Attribution required. Additionally, OpenAI's Terms of Use and Anthropic's Terms of Service apply to the LLM-generated text content.

Forbidden Usage

No commercial use. No illegal, harmful, violent, or infringing use.

Processes

Ethical Review

Data is synthetically generated by LLMs, not collected from human subjects. Human linguists reviewed and rated the generated text.

Intended Use

Synthetic audio generation via TTS for ASR training, research and evaluation in automatic speech recognition (ASR), natural language processing (NLP), machine translation, and related fields for low-resource African languages.

Metadata

This corpus was created as part of a project to develop ASR models for low-resource African languages using synthetic data. Sentences were generated by large language models (GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet) and then evaluated by human linguists who rated readability, grammatical correctness, word validity, and translation adequacy on standardized scales. The text was subsequently used to generate synthetic audio via text-to-speech systems, and the resulting synthetic voice data was used to fine-tune ASR models. The work was led by Dimagi. Please check README.md and published paper for more information.

Citation information:

@inproceedings{DeRenzi2025,
  title={Synthetic Voice Data for Automatic Speech Recognition in African Languages},
  author={DeRenzi, Brian and Dixon, Anna and Farhi, Mohamed Aymane and Resch, Christian},
  booktitle={Proceedings of the 1st Workshop on Advancing NLP for Low-Resource Languages associated with RANLP},
  pages={152--186},
  year={2025},
  doi={10.48550/arXiv.2507.17578}
}

Dataset also hosted at Hugging Face