Burushaski-English Speech Translation Corpus

Description

The Burushaski Speech–English Parallel Corpus is a community-driven language resource developed to support research in speech translation, machine translation, automatic speech recognition, and other language technologies for Burushaski, an under-resourced language isolate spoken in northern Pakistan. The dataset was collected using an audio-first, linguistically informed framework that combines structured elicitation targeting high-frequency vocabulary and key grammatical phenomena with the collection of functional and conversational language relevant to real-world communication. Data collection was facilitated through a custom mobile application that standardized prompts while enabling scalable community participation, and the development process incorporated continuous feedback from Burushaski-speaking contributors to improve linguistic coverage and cultural relevance. The current pilot release contains approximately 15 hours of curated speech data comprising 14,970 recorded utterances from native speakers, with each audio recording paired with an English translation, creating a parallel corpus intended to advance research, promote language inclusion in AI, and expand the digital presence of Burushaski.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Burushaski Speech – English Text Parallel Dataset

Overview

The Burushaski Speech – English Text Parallel Dataset is a bilingual speech–text corpus designed to support research in speech translation, automatic speech recognition (ASR), machine translation (MT), and related low-resource language technologies.

The dataset contains Burushaski speech recordings paired with corresponding English text translations. It is intended for advancing research in cross-lingual speech processing for under-resourced languages.

Dataset Statistics

Split	Number of Utterances
Train	11,976
Test	2,994
Total	14,970

Train–Test Split Ratio: 80–20

Dataset Organization

The dataset is divided into training and testing splits. Each split contains audio files and their corresponding English text transcriptions.

Folder Structure

dataset/
├── train/
│   ├── AUDIO/
│   └── TXT/
└── test/
    ├── AUDIO/
    └── TXT/

Directory Description

AUDIO/ Contains Burushaski speech recordings. Most files include a participant ID as a suffix in the filename. However, not all recordings have an associated participant identifier.
TXT/ Contains English text translations aligned with the corresponding audio recordings.

Each audio file in AUDIO/ corresponds to a text file in TXT/ with the same base filename where available.

Metadata

A CSV file is provided containing participant-level metadata. In most cases, participant IDs are embedded in the filenames of audio recordings and can be used to link recordings to metadata entries.

However, not all audio files include participant identifiers, meaning some recordings cannot be associated with metadata.

Data Collection Methodology

The dataset was developed following the methodology described in:

Developing Burushaski–English Translation Dataset CHiPSAL Workshop, LREC 2026

Paper: http://lrec-conf.org/proceedings/lrec2026/workshops/chipsal/2026.chipsal-1.0.pdf

Researchers are strongly encouraged to refer to the publication for detailed information on:

Data collection protocol
Translation and annotation methodology
Recording setup
Quality control procedures
Dataset limitations and ethical considerations

Limitations

Coverage is limited due to the low-resource nature of Burushaski.
Some dialectal and sociolinguistic variations may not be fully represented.
Not all audio files can be linked to participant metadata due to missing identifiers.
Potential demographic imbalance should be considered during model training and evaluation.

Citation

If you use this dataset, please cite the associated publication.

Contact

For questions, clarifications, or collaboration inquiries, contact data maintainer at saleem.tauqeer@gmail.com.

Burushaski-English Speech Translation Corpus

Description

Specifics

Considerations

Processes

Metadata

Burushaski Speech – English Text Parallel Dataset

Overview

Dataset Statistics

Dataset Organization

Folder Structure

Directory Description

Metadata

Data Collection Methodology

Limitations

Citation

Contact