License:
NOODL-1.0
Steward:
Institute of African Digital HumanitiesDataset ID:
cmp1ma8ty007eo607klisho1a
Task: TTS
Release Date: 5/11/2026
Format: MP3, TSV
Size: 297.28 MB
Share
This dataset comprises audio recordings of Efik speech aligned with textual transcriptions. The dataset is structured into 10 folders, each containing audio files and a corresponding audio-text mapping file. The audio clips are short, typically ranging from 1 to 30 seconds, and are suitable for training and evaluating Text-to-Speech (TTS) systems. The dataset follows a structured format where each audio file is paired with its corresponding transcription in a tab-separated mapping file. The textual content used in this dataset originates from a variety of written sources in Efik, including encyclopaedic and informational texts, as well as everyday conversational phrases. These texts were segmented into short utterances suitable for read speech and TTS modelling.
Licensing
Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)
https://licensingafricandatasets.com/nwulite-obodo-licenseRestrictions/Special Constraints
- For research and scientific use only - You agree not to re-host or redistribute this dataset
Forbidden Usage
You agree not to use the data for: - Generative AI - Voice cloning or speaker imitation - Reproduction, duplication, modification, or redistribution - Commercial use without explicit permission
Intended Use
It aims to support: - Language technology development for an important language of southeastern Nigeria - Development of speech technologies for under-served African language communities - Educational applications in multilingual contexts - Research in low-resource and African language speech synthesis - Preservation and documentation of the Efik language
Efik (native name: Efik) is a Lower Cross language of the Niger-Congo language family, belonging to the Cross River branch of the Benue-Congo sub-family. It is spoken primarily in Cross River State and Akwa Ibom State in southeastern Nigeria, with significant communities in the cities of Calabar, Uyo, and surrounding areas. Efik has between 400,000 and 700,000 native speakers, with considerably more second-language speakers across the region.
Efik holds considerable historical and cultural significance as the language of the Efik people, who established powerful city-states at Calabar that dominated the trade routes of the Cross River delta from the 17th century onward. Calabar, the capital of Cross River State, remains the main centre of Efik language and culture.
Efik served as a major lingua franca throughout much of southeastern Nigeria and was one of the first Nigerian languages to have a substantial written literature. The first translation of the Bible into an African language was made in Efik in 1868 by the United Presbyterian Mission. The language continues to be used in education, religious life, and local governance in Cross River State.
Efik is closely related to Ibibio and forms part of a dialect continuum sometimes referred to as Efik-Ibibio or Efik-Ibibio-Anaang. The principal dialect of Efik, and the one regarded as standard, is the dialect of Calabar (also called Creek Town Efik or Old Calabar Efik), which served as the basis for the missionary and literary tradition.
Other closely related varieties include:
Efut, spoken in the Efut clan area near Calabar
Creek Town Efik, the most prestigious and widely documented variety, used in the biblical tradition
Ibibio, a closely related language that shares significant lexical and grammatical features with Efik
Anaang, another closely related variety spoken in Akwa Ibom State
The variety represented in this dataset is the standard Efik of Calabar, which is the form used in written materials, education, religious texts, and broadcast media.
Efik uses a Latin-based orthography. The standard orthography was largely shaped by the 19th-century missionary tradition, particularly the work of the United Presbyterian Mission, and was later revised through collaboration between linguists and mother-tongue speakers.
Vowels
Efik has seven vowel phonemes, and vowel length is phonemically contrastive in some analyses. The standard orthography uses the following vowel letters:
a, e, ɛ (written as "e" in some contexts), i, o, ɔ (written with a subscript dot or as a distinct letter in some orthographies), u
Consonants
Efik consonants include a number of sounds that require special notation:
Labial-velar stops: kp, gb — these are simultaneous bilabial and velar articulations common in Cross River languages
Prenasalized consonants: mb, nd, ng — common in the language
Palatal sounds: represented by digraphs in the standard orthography
Nasals: m, n, ŋ (written as "ñ" or "ng" in some orthographies)
Tone System
Efik is a tonal language. The standard orthography marks tone using diacritics:
High tone: marked with an acute accent (á, é, í, ó, ú)
Low tone: marked with a grave accent (à, è, ì, ò, ù)
Mid tone: sometimes left unmarked or marked with a macron (ā, ē, ī, ō, ū)
Falling tone: marked with a circumflex accent (â, ê, î, ô, û)
Tone is phonemically contrastive and grammatically significant in Efik, playing a key role in distinguishing lexical items and grammatical constructions. The transcriptions in this dataset include tone marks where present in the source materials.
Syllable Structure
Efik has CV and CVC syllable structures as well as syllabic nasals. Consonant clusters are relatively uncommon within syllables.
The textual material in this dataset originates from written sources in Efik covering informational, encyclopaedic, and conversational content. The texts were segmented into short utterances suitable for read speech and used as prompts for audio recording sessions. The speaker recorded the utterances in the standard Calabar dialect of Efik.
This dataset is derived from prompted read speech. The speaker read aloud pre-written Efik texts drawn from informational, encyclopaedic, and everyday conversational sources. The content covers a range of general topics including everyday life, culture, and encyclopaedic knowledge.
The dataset has been structured as segmented, read-style speech suitable for speech synthesis tasks.
The dataset is composed of 10 folders containing audio clips and corresponding mapping files.
Each folder contains between 109 and 175 audio files. Individual audio clips typically range from 1 to 30 seconds in duration.
Folder-level durations range from approximately 17 minutes to over 46 minutes of audio.
The dataset represents a total of 1,362 audio files with a combined duration of approximately 5 hours 32 minutes and 4 seconds of segmented Efik speech.
A detailed breakdown of durations and file counts per folder is provided below.
| Folder | Files | Duration |
|---|---|---|
| tts_Efik_dataset_01_168clips_1205s_20260424-1742 | 168 | 16m 59s |
| tts_Efik_dataset_02_175clips_1317s_20260424-1933 | 175 | 19m 26s |
| tts_Efik_dataset_03_172clips_2044s_20260426-1253 | 164 | 27m 51s |
| tts_Efik_dataset_04_125clips_1962s_20260426-2328 | 125 | 26m 13s |
| tts_Efik_dataset_05_125clips_3006s_20260502-1224 | 125 | 46m 39s |
| tts_Efik_dataset_06_109clips_2429s_20260502-2122 | 109 | 35m 52s |
| tts_efik_dataset_07_125clips_2753s_20260506-2119 | 125 | 44m 29s |
| tts_efik_dataset_08_123clips_2429s_20260506-2244 | 123 | 39m 20s |
| tts_Efik_dataset_09_125clips_2412s_20260507-0832 | 125 | 37m 33s |
| tts_Efik_dataset_10_123clips_2336s_20260507-0929 | 123 | 37m 38s |
| GRAND TOTAL | 1,362 | 5h 32m 04s |
Each folder in the dataset contains:
A collection of audio files in MP3 format
A tab-separated mapping file linking each audio file to its transcription
An "attempts" subfolder containing alternative recordings where more than one take was made
Each line in the mapping file follows the format:
audio_filename.mp3 key transcription attempts
The dataset is designed for TTS pipelines requiring paired audio-text data.
3847bbb1e4b0aa6a97541f81b614c42a.mp3 | Ami nka udua
c8a9cf8be4f700566deaf6d589db3327.mp3 | Bassey é- nyéné úfọk kíét.
bc8b2f94147572160e47e2261c4d809c.mp3 | Nso ke Ekot fi?
d3e6da14f086b54af1df5abfc600ff48.mp3 | Okuk ifang?