Synthetic Text Corpus for African Language ASR

Description

This dataset contains 13,488 synthetic sentences across 10 African languages (Bambara, Chichewa, Hausa, Kanuri, Luo, Nande, Somali, Twi, Wolof, Yoruba) generated using large language models (GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet). Each sentence has been evaluated by human linguists on readability and naturalness (1-7 scale), translation adequacy and accuracy (1-7 scale), grammatical correctness, word validity, and presence of notable errors. Corrected versions are provided where applicable. The dataset was created by Dimagi to support ASR, NLP research, and evaluation for low-resource African languages. See: DeRenzi et al. (2025), "Synthetic Voice Data for Automatic Speech Recognition in African Languages", arXiv:2507.17578.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Non-commercial use only. Attribution required. Additionally, OpenAI's Terms of Use and Anthropic's Terms of Service apply to the LLM-generated text content.

This corpus was created as part of a project to develop ASR models for low-resource African languages using synthetic data. Sentences were generated by large language models (GPT-4o, GPT-4.5, Claude 3.5 Sonnet, Claude 3.7 Sonnet) and then evaluated by human linguists who rated readability, grammatical correctness, word validity, and translation adequacy on standardized scales. The text was subsequently used to generate synthetic audio via text-to-speech systems, and the resulting synthetic voice data was used to fine-tune ASR models. The work was led by Dimagi. Please check README.md and published paper for more information.

Citation information:

@inproceedings{DeRenzi2025,
  title={Synthetic Voice Data for Automatic Speech Recognition in African Languages},
  author={DeRenzi, Brian and Dixon, Anna and Farhi, Mohamed Aymane and Resch, Christian},
  booktitle={Proceedings of the 1st Workshop on Advancing NLP for Low-Resource Languages associated with RANLP},
  pages={152--186},
  year={2025},
  doi={10.48550/arXiv.2507.17578}
}

Dataset also hosted at Hugging Face

Synthetic Text Corpus for African Language ASR

Description

Specifics

Considerations

Processes

Metadata