Awal Tamazight Dataset

Description

This dataset is a compilation of Tamazight (zgh) language resources created by CIEMEN as part of the Awal project (https://awaldigital.org/), with funding from the Municipality of Barcelona and the Government of Catalonia. It includes 1,002 monolingual sentences from a Tamazight language learning material, and over 417,000 parallel sentence pairs spanning multiple language pairs: English–Tamazight, French–Tamazight, Catalan–Tamazight, Spanish–Tamazight, and Arabic–Tamazight. Parallel data comes from community contributions to the Awal platform, Tatoeba sentence pairs transliterated into Tifinagh script, Tamazight proverbs, web localization strings (Mozilla Common Voice, Awal platform), and segmented document translations. The dataset totals approximately 4.6 million words across all files.

Specifics

Licensing

Creative Commons Attribution 4.0 International (CC-BY-4.0)

https://spdx.org/licenses/CC-BY-4.0.html

Considerations

Restrictions/Special Constraints

Attribution required (CC-BY-4.0). Common Voice localization strings (pontoon-CV-zgh-en.tsv) are additionally governed by Mozilla Public License 2.0.

This dataset is a compilation of several distinct Tamazight (zgh) language resources created by CIEMEN as part of the Awal project.

**Monolingual **

tc_wajdm_v1.txt: sentences extracted from the tc wajdm Tamazight language learning material produced by CIEMEN.

**Parallel **

AWAL contributions: community-contributed translations between Tamazight and English, French, Catalan, Spanish, and Arabic, collected through the Awal crowdsourcing platform. The dump reflects the platform state as of 2026-04-15.
Tatoeba (transliterated): sentence pairs from Tatoeba.org covering Arabic, Catalan, English, French, Moroccan Arabic, and Spanish paired with Tamazight, transliterated into Tifinagh script using an open-source Python script. Retrieved February 2024.
Proverbs: Tamazight proverbs with Catalan translations, collected by CIEMEN.
Localizations: parallel segments from localization of the Mozilla Common Voice platform (via Pontoon, MPL 2.0) and the Awal web platform, in English and Tamazight.
Document translations: segmented translations of the Awal TICAM'26 paper (English–Tamazight) and the document "Tamazight en la escuela pública marroquí: una carrera de fondo" (Spanish–Tamazight).

The Awal project also contributed Tamazight corrections to two multilingual benchmark datasets: OLDI Seed and FLORES+. Details on the correction methodology are described in the WMT 2025 paper listed below.

This dataset is also hosted on HuggingFace at collectivat/amazic. Please check README.md for more information.

If you use this data, please cite:

Awal – Community-Powered Language Technology for Tamazight. Alp Öktem, Farida Boudichat. TICAM'25, Rabat, Morocco, December 2025.

For FLORES+ and OLDI datasets and Machine Translation experiments:

Correcting the Tamazight Portions of FLORES+ and OLDI Seed Datasets. Alp Oktem, Mohamed Aymane Farhi, Brahim Essaidi, Naceur Jabouja, Farida Boudichat. Proceedings of the Tenth Conference on Machine Translation (WMT 2025) – OLDI shared task, Suzhou, China, November 2025.

Description

Specifics

Considerations

Processes

Metadata