Awal Tamazight Dataset
License:
CC-BY-4.0
Steward:
CommunityTask: LM
Release Date: 4/15/2026
Format: TSV, JSON, TXT
Size: 11.57 MB
Share
Description
This dataset is a compilation of Tamazight (zgh) language resources created by CIEMEN as part of the Awal project (https://awaldigital.org/), with funding from the Municipality of Barcelona and the Government of Catalonia. It includes 1,002 monolingual sentences from a Tamazight language learning material, and over 417,000 parallel sentence pairs spanning multiple language pairs: English–Tamazight, French–Tamazight, Catalan–Tamazight, Spanish–Tamazight, and Arabic–Tamazight. Parallel data comes from community contributions to the Awal platform, Tatoeba sentence pairs transliterated into Tifinagh script, Tamazight proverbs, web localization strings (Mozilla Common Voice, Awal platform), and segmented document translations. The dataset totals approximately 4.6 million words across all files.
Specifics
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlConsiderations
Restrictions/Special Constraints
Attribution required (CC-BY-4.0). Common Voice localization strings (pontoon-CV-zgh-en.tsv) are additionally governed by Mozilla Public License 2.0.
Forbidden Usage
No use without attribution to CIEMEN and the Awal project. Usage of Common Voice localization strings must comply with Mozilla Public License 2.0.
Processes
Ethical Review
Data consists of written text only. Community contributions were made voluntarily through the Awal platform. No personally identifiable information is included. The dataset supports the preservation and digital development of Tamazight.
Intended Use
Training and evaluation of machine translation systems for Tamazight. Language model pre-training. Linguistic research on Tamazight. Development of NLP tools and resources.
Metadata
This dataset is a compilation of several distinct Tamazight (zgh) language resources created by CIEMEN as part of the Awal project.
**Monolingual **
tc_wajdm_v1.txt: sentences extracted from the tc wajdm Tamazight language learning material produced by CIEMEN.
**Parallel **
AWAL contributions: community-contributed translations between Tamazight and English, French, Catalan, Spanish, and Arabic, collected through the Awal crowdsourcing platform. The dump reflects the platform state as of 2026-04-15.
Tatoeba (transliterated): sentence pairs from Tatoeba.org covering Arabic, Catalan, English, French, Moroccan Arabic, and Spanish paired with Tamazight, transliterated into Tifinagh script using an open-source Python script. Retrieved February 2024.
Proverbs: Tamazight proverbs with Catalan translations, collected by CIEMEN.
Localizations: parallel segments from localization of the Mozilla Common Voice platform (via Pontoon, MPL 2.0) and the Awal web platform, in English and Tamazight.
Document translations: segmented translations of the Awal TICAM'26 paper (English–Tamazight) and the document "Tamazight en la escuela pública marroquí: una carrera de fondo" (Spanish–Tamazight).
The Awal project also contributed Tamazight corrections to two multilingual benchmark datasets: OLDI Seed and FLORES+. Details on the correction methodology are described in the WMT 2025 paper listed below.
This dataset is also hosted on HuggingFace at collectivat/amazic. Please check README.md for more information.
If you use this data, please cite:
Awal – Community-Powered Language Technology for Tamazight. Alp Öktem, Farida Boudichat. TICAM'25, Rabat, Morocco, December 2025.
For FLORES+ and OLDI datasets and Machine Translation experiments:
Correcting the Tamazight Portions of FLORES+ and OLDI Seed Datasets. Alp Oktem, Mohamed Aymane Farhi, Brahim Essaidi, Naceur Jabouja, Farida Boudichat. Proceedings of the Tenth Conference on Machine Translation (WMT 2025) – OLDI shared task, Suzhou, China, November 2025.