License:
CC-BY-4.0
Steward:
CLEAR GlobalTask: LM
Release Date: 4/16/2026
Format: TXT
Size: 545.68 KB
Share
A text corpus of 10,281 randomized sentences (90,706 words) extracted from books by Kanuri authors Dr. Baba Kura Alkali Gazali, Lawan Dalama, Kaka Gana Abba, and Lawan Hassan. The corpus includes both original and normalized (lowercased, punctuation-removed) versions. It was compiled by CLEAR Global (formerly Translators without Borders) for the creation of open-source language technology. These sentences were also recorded by multiple speakers to make a speech corpus published within TWB Voice.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Attribution to CLEAR Global and the authors is required.
Forbidden Usage
Creating harmful, threatening, defamatory, or deceptive content. Victimizing or intimidating individuals or groups. Harming minors. Any use contrary to CLEAR Global's humanitarian mission. Violating applicable law.
Ethical Review
The authors gave written consent for their material to be published and were compensated for their contribution. Sentences were randomized in order to make reproduction of their literary material impossible.
Intended Use
Development of open-source language technology for Kanuri, including language modeling, text normalization, and NLP research.
This corpus was compiled from books by four Kanuri authors as part of the Gamayun initiative by CLEAR Global (formerly Translators without Borders). Sentences were randomized and provided in both original and text-normalized forms. Please check README.md for more information.
Some of these sentences were also recorded by multiple speakers to make a speech corpus published within TWB Voice.
Also published on the CLEAR Global HuggingFace repository.