License:
CC-BY-SA-4.0
Steward:
CLEAR GlobalTask: MT
Release Date: 4/23/2026
Format: TSV
Size: 405.06 KB
Share
The Tigrinya portion of CLEAR Global's Gamayun Language Data Kits — 5,000 parallel English–Tigrinya sentences (the `kit5k` mini-kit). English source sentences were drawn from the Tatoeba repository using a selection algorithm that ensures representation of the most frequently used words in English language; Tigrinya translations were produced by professionals and volunteers of the Translators without Borders (now CLEAR Global) translator community. The package includes a single TSV with the parallel sentence pairs and a second TSV containing the original English source sentences with their Tatoeba IDs. Gamayun is part of CLEAR Global's initiative to develop open-source language technology for under-resourced languages used in humanitarian contexts.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
Attribution to CLEAR Global is required. Derivative works must be distributed under the same CC-BY-SA-4.0 license (ShareAlike).
Forbidden Usage
Using the data without giving appropriate credit to CLEAR Global. Redistributing the data or any derivative works under any other license.
Ethical Review
Source sentences are general-domain content from the openly licensed Tatoeba repository. Translations were produced by professional and volunteer translators from the Translators without Borders / CLEAR Global community, who were credited and (where applicable) compensated for their work.
Intended Use
Development of open-source language technology for Tigrinya, in particular machine translation between Tigrinya and English, as well as related NLP tasks (language modeling, alignment, lexicon induction).
This is the Tigrinya–English mini-kit (kit5k) from CLEAR Global's broader Gamayun Language Data Kits initiative, which provides parallel data for under-resourced languages by translating a curated set of general-domain English sentences sourced from Tatoeba. The selection algorithm used to pick source sentences is documented in the corepus-gen repository. Translations were performed by professionals and volunteers of CLEAR Global's Translators without Borders community.
If you use this data, please cite Gamayun – Language Technology for Humanitarian Response:
@inproceedings{oktem2020gamayun,
title = {Gamayun -- Language Technology for Humanitarian Response},
author = {Öktem, Alp and Albayk Jaam, Muhannad and DeLuca, Eric and Tang, Grace},
booktitle = {2020 IEEE Global Humanitarian Technology Conference (GHTC)},
year = {2020},
address = {Virtual},
m {October 29 -- November 1}
}
Please check README.md for more information.
Also published in Hugging Face