License:
Apache-2.0
Steward:
Tamazight NLPTask: ASR
Release Date: 4/29/2026
Format: WAV, JSONL
Size: 459.09 MB
Share
This dataset provides a parsed, formatted, and ready-to-use Amazigh Voice Dataset. It contains voice recordings and corresponding text transcripts in Standard Moroccan Amazigh (ⵜⴰⵎⴰⵣⵉⵖⵜ ⵜⴰⵏⴰⵡⴰⵢⵜ ⵜⴰⵎⵓⵔⴰⴽⵓⵛⵜ) intended for training Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models.
Restrictions/Special Constraints
No restrictions apply as long as the prohibitions listed in the "Forbidden usages" section are respected.
Forbidden Usage
This data should not be used to generate malicious voice clones or deepfakes intended for impersonation, fraud, or harassment.
Ethical Review
The dataset contains the voice recordings of the creator. No other personally identifiable information (PII) is included in the audio or text.
Intended Use
This dataset is intended for training or fine-tuning Speech-to-Text (STT / ASR) models or Text-to-Speech (TTS) models. This dataset can also be used in linguistic research regarding Amazigh phonetics and speech.
This dataset contains 1,801 samples with the following fields:
audio_filepath: The relative path to the audio file.
text: The string transcript of the audio in the Tifinagh script.
subset: The dataset has two different subsets recorded using different microphones. While 'subset_1' has mono audio, 'subset_2' is stereo.
1,799 files have a sampling rate of 48 kHz. 2 files have a sampling rate of 44.1 kHz.
The raw audio data is stored in the TOSD/clips subdirectory while the metadata and transcripts are stored in the TOSD/metadata.jsonl file.