Punjabi 10 Hours TTS
License:
CC-BY-NC-SA-4.0
Steward:
MirasAITask: TTS
Release Date: 4/14/2026
Format: WEBM, TSV
Size: 481.96 MB
Share
Description
The Punjabi TTS Dataset (Shahmukhi) is a high-quality speech corpus containing approximately 10 hours of read speech in Punjabi written in the Shahmukhi script. It is designed to support text-to-speech development, speech synthesis research, pronunciation modeling, and broader language technology work for Punjabi in its Perso-Arabic writing tradition. The dataset consists of paired audio recordings and text transcripts, carefully prepared to reflect clear pronunciation, natural reading style, and accurate Shahmukhi orthography. It is suitable for building and evaluating Punjabi TTS systems, as well as related applications such as speech processing, phonetic analysis, and low-resource language technology development.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
Non-commercial use only with attribution; derivatives must collect special permission and must be under the same license.
Forbidden Usage
Do not use this dataset for speaker identification, harmful content, or any commercial purpose.
Metadata
Language
Punjabi is a major Indo-Aryan language spoken by millions of people across Pakistan and India. In Pakistan, it is commonly written in the Shahmukhi script, a Perso-Arabic writing system, and serves as an important language of daily communication, culture, poetry, and oral tradition.
Script
Shahmukhi (Perso-Arabic script): ا، ب، پ، ت، ٹ، ث، ج، چ، ح، خ، د، ڈ، ذ، ر، ڑ، ز، ژ، س، ش، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، و، ہ، ھ، ء، ی، ے۔
Speaker Details
Speaker-1 (ID: pnb1): Male, 36 years old
Speaker-2 (ID: pnb2): Male, 29 years old
Data Structure
The dataset is provided as a
.gzarchive.The archive contains two speaker folders:
Speaker 1andSpeaker 2.Each folder includes approximately 5 hours of recordings.
Together, the two speakers make up around 10 hours of Punjabi TTS data.
Each speaker folder contains the audio recordings and their corresponding text/transcript files.