Task: MT
Release Date: 6/19/2026
Format: WAV, TXT
Size: 921.47 MB
Share
The Burushaski Speech–English Parallel Corpus is a community-driven language resource developed to support research in speech translation, machine translation, automatic speech recognition, and other language technologies for Burushaski, an under-resourced language isolate spoken in northern Pakistan. The dataset was collected using an audio-first, linguistically informed framework that combines structured elicitation targeting high-frequency vocabulary and key grammatical phenomena with the collection of functional and conversational language relevant to real-world communication. Data collection was facilitated through a custom mobile application that standardized prompts while enabling scalable community participation, and the development process incorporated continuous feedback from Burushaski-speaking contributors to improve linguistic coverage and cultural relevance. The current pilot release contains approximately 15 hours of curated speech data comprising 14,970 recorded utterances from native speakers, with each audio recording paired with an English translation, creating a parallel corpus intended to advance research, promote language inclusion in AI, and expand the digital presence of Burushaski.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
The dataset is intended solely for research and scientific purposes.
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset. Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden.
Intended Use
The dataset is intended for training and evaluating machine translation systems.
The Burushaski Speech – English Text Parallel Dataset is a bilingual speech–text corpus designed to support research in speech translation, automatic speech recognition (ASR), machine translation (MT), and related low-resource language technologies.
The dataset contains Burushaski speech recordings paired with corresponding English text translations. It is intended for advancing research in cross-lingual speech processing for under-resourced languages.
| Split | Number of Utterances |
|---|---|
| Train | 11,976 |
| Test | 2,994 |
| Total | 14,970 |
Train–Test Split Ratio: 80–20
The dataset is divided into training and testing splits. Each split contains audio files and their corresponding English text transcriptions.
dataset/
├── train/
│ ├── AUDIO/
│ └── TXT/
└── test/
├── AUDIO/
└── TXT/
AUDIO/ Contains Burushaski speech recordings. Most files include a participant ID as a suffix in the filename. However, not all recordings have an associated participant identifier.
TXT/ Contains English text translations aligned with the corresponding audio recordings.
Each audio file in AUDIO/ corresponds to a text file in TXT/ with the same base filename where available.
A CSV file is provided containing participant-level metadata. In most cases, participant IDs are embedded in the filenames of audio recordings and can be used to link recordings to metadata entries.
However, not all audio files include participant identifiers, meaning some recordings cannot be associated with metadata.
The dataset was developed following the methodology described in:
Developing Burushaski–English Translation Dataset CHiPSAL Workshop, LREC 2026
Paper: http://lrec-conf.org/proceedings/lrec2026/workshops/chipsal/2026.chipsal-1.0.pdf
Researchers are strongly encouraged to refer to the publication for detailed information on:
Data collection protocol
Translation and annotation methodology
Recording setup
Quality control procedures
Dataset limitations and ethical considerations
Coverage is limited due to the low-resource nature of Burushaski.
Some dialectal and sociolinguistic variations may not be fully represented.
Not all audio files can be linked to participant metadata due to missing identifiers.
Potential demographic imbalance should be considered during model training and evaluation.
If you use this dataset, please cite the associated publication.
For questions, clarifications, or collaboration inquiries, contact data maintainer at saleem.tauqeer@gmail.com.