License:
CC-BY-4.0
Steward:
CLEAR GlobalTask: MT
Release Date: 5/5/2026
Format: TSV
Size: 494.43 KB
Share
The Lingala portion of CLEAR Global's Gamayun Language Data Kits — 5,000 parallel French–Lingala sentences (the `kit5k` mini-kit). French source sentences were drawn from the Tatoeba repository using a selection algorithm that ensures representation of the most frequently used words in French; Lingala translations were produced by professionals and volunteers of the Translators without Borders (now CLEAR Global) translator community. The package includes a single TSV with the parallel sentence pairs and a second TSV containing the original French source sentences with their Tatoeba IDs. Gamayun is part of CLEAR Global's initiative to develop open-source language technology for under-resourced languages used in humanitarian contexts.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Attribution to CLEAR Global is required.
Forbidden Usage
Using the data without giving appropriate credit to CLEAR Global.
Ethical Review
Source sentences are general-domain content from the openly licensed Tatoeba repository. Translations were produced by professional and volunteer translators from the Translators without Borders / CLEAR Global community, who were credited and (where applicable) compensated for their work.
Intended Use
Development of open-source language technology for Lingala, in particular machine translation between Lingala and French, as well as related NLP tasks (language modeling, alignment, lexicon induction).
This is the Lingala–French mini-kit (kit5k) from CLEAR Global's broader Gamayun Language Data Kits initiative, which provides parallel data for under-resourced languages by translating a curated set of general-domain French sentences sourced from Tatoeba. The selection algorithm used to pick source sentences is documented in the corepus-gen repository. Translations were performed by professionals and volunteers of CLEAR Global's (formerly Translators without Borders) translator community.
If you use this data, please cite Gamayun – Language Technology for Humanitarian Response:
@inproceedings{oktem2020gamayun,
title = {Gamayun -- Language Technology for Humanitarian Response},
author = {Öktem, Alp and Albayk Jaam, Muhannad and DeLuca, Eric and Tang, Grace},
booktitle = {2020 IEEE Global Humanitarian Technology Conference (GHTC)},
year = {2020},
address = {Virtual},
m {October 29 -- November 1}
}
Please check README.md for more information.
Also published in Hugging Face