License:
CC-BY-4.0
Steward:
CLEAR GlobalTask: MT
Release Date: 4/30/2026
Format: TSV
Size: 1.68 MB
Share
The Hausa portion of CLEAR Global's Gamayun Language Data Kits — 30,000 parallel English–Hausa sentences in total, distributed across three independent kit sizes: 5,000 (`kit5k`), 10,000 (`kit10k`), and 15,000 (`kit15k`). English source sentences were drawn from the Tatoeba repository using a selection algorithm that ensures representation of the most frequently used words in English; Hausa translations were produced by professionals and volunteers of the Translators without Borders (now CLEAR Global) translator community. Each kit TSV is accompanied by a core kit TSV containing the original English source sentences with their Tatoeba IDs. Note that kit sizes are independent selections — `kit10k` is not a superset of `kit5k`. Gamayun is part of CLEAR Global's initiative to develop open-source language technology for under-resourced languages used in humanitarian contexts.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Attribution to CLEAR Global is required.
Forbidden Usage
Using the data without giving appropriate credit to CLEAR Global.
Ethical Review
Source sentences are general-domain content from the openly licensed Tatoeba repository. Translations were produced by professional and volunteer translators from the Translators without Borders / CLEAR Global community, who were credited and (where applicable) compensated for their work.
Intended Use
Development of open-source language technology for Hausa, in particular machine translation between Hausa and English, as well as related NLP tasks (language modeling, alignment, lexicon induction).
This is the Hausa–English kit collection from CLEAR Global's broader Gamayun Language Data Kits initiative, which provides parallel data for under-resourced languages by translating a curated set of general-domain English sentences sourced from Tatoeba. The selection algorithm used to pick source sentences is documented in the corepus-gen repository. Translations were performed by professionals and volunteers of CLEAR Global's (formerly Translators without Borders) translator community.
If you use this data, please cite Gamayun – Language Technology for Humanitarian Response:
@inproceedings{oktem2020gamayun,
title = {Gamayun -- Language Technology for Humanitarian Response},
author = {Öktem, Alp and Albayk Jaam, Muhannad and DeLuca, Eric and Tang, Grace},
booktitle = {2020 IEEE Global Humanitarian Technology Conference (GHTC)},
year = {2020},
address = {Virtual},
m {October 29 -- November 1}
}
Please check README.md for more information.
Also published in Hugging Face