License:
CC-BY-NC-4.0
Steward:
LocaleNLPTask: MT
Release Date: 3/23/2026
Format: csv
Size: 164.32 KB
Share
This English–Hausa Parallel Corpus is a curated bilingual dataset of 5,000 aligned sentence pairs, translated from English into Hausa and organized into a clean sentence-level format to ensure reliable alignment. The dataset is designed to support machine translation training and evaluation, bilingual lexicon development, and broader linguistic and natural language processing (NLP) research for Hausa, including data-driven language technology development.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
Dataset is intended for research and non-commercial use
Forbidden Usage
Generating harmful or misleading content Commercial use without permission Misrepresentation of Hausa language or culture
Hausa is a widely spoken Chadic language used across West Africa, particularly in Nigeria and Niger. It serves as a major lingua franca with strong cultural and linguistic importance. Hausa is primarily written using the Latin-based Boko script.
English serves as the source language and represents modern, general-purpose usage.
The dataset consists of general-purpose sentences translated from English into Hausa. It is designed to support research, machine translation systems, and the development of Hausa language technologies.
This corpus is a bilingual English–Hausa parallel dataset containing 5,000 professionally aligned sentence pairs (English → Hausa). The dataset is structured at the sentence level and formatted for direct use in NLP pipelines, including machine translation, evaluation benchmarks, and linguistic analysis.
• Sentence Pairs: 5,000
• English Words: 41,727
• Hausa Words: 45,921
• Total Words: 87,648
• Translation Direction: English → Hausa
• Content Type: Parallel sentences
• Script: English (Latin), Hausa (Latin )
• Unicode normalization (NFC)
• Standardize punctuation and spacing
• Verify sentence alignment
• Remove duplicates
• Filter noisy or corrupted text