Tamazight Open Speech Dataset

Description

This dataset provides a parsed, formatted, and ready-to-use Amazigh Voice Dataset. It contains voice recordings and corresponding text transcripts in Standard Moroccan Amazigh (ⵜⴰⵎⴰⵣⵉⵖⵜ ⵜⴰⵏⴰⵡⴰⵢⵜ ⵜⴰⵎⵓⵔⴰⴽⵓⵛⵜ) intended for training Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models.

Specifics

Licensing

Apache License 2.0 (Apache-2.0)

https://spdx.org/licenses/Apache-2.0.html

Considerations

Restrictions/Special Constraints

No restrictions apply as long as the prohibitions listed in the "Forbidden usages" section are respected.

Forbidden Usage

This data should not be used to generate malicious voice clones or deepfakes intended for impersonation, fraud, or harassment.

Processes

Ethical Review

This dataset contains 1,801 samples with the following fields:

audio_filepath: The relative path to the audio file.
text: The string transcript of the audio in the Tifinagh script.
subset: The dataset has two different subsets recorded using different microphones. While 'subset_1' has mono audio, 'subset_2' is stereo.

1,799 files have a sampling rate of 48 kHz. 2 files have a sampling rate of 44.1 kHz.

The raw audio data is stored in the TOSD/clips subdirectory while the metadata and transcripts are stored in the TOSD/metadata.jsonl file.

Description

Specifics

Considerations

Processes

Metadata