Punjabi 10 Hours TTS

Description

The Punjabi TTS Dataset (Shahmukhi) is a high-quality speech corpus containing approximately 10 hours of read speech in Punjabi written in the Shahmukhi script. It is designed to support text-to-speech development, speech synthesis research, pronunciation modeling, and broader language technology work for Punjabi in its Perso-Arabic writing tradition. The dataset consists of paired audio recordings and text transcripts, carefully prepared to reflect clear pronunciation, natural reading style, and accurate Shahmukhi orthography. It is suitable for building and evaluating Punjabi TTS systems, as well as related applications such as speech processing, phonetic analysis, and low-resource language technology development.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Non-commercial use only with attribution; derivatives must collect special permission and must be under the same license.

Language

Punjabi is a major Indo-Aryan language spoken by millions of people across Pakistan and India. In Pakistan, it is commonly written in the Shahmukhi script, a Perso-Arabic writing system, and serves as an important language of daily communication, culture, poetry, and oral tradition.

Script

Shahmukhi (Perso-Arabic script): ا، ب، پ، ت، ٹ، ث، ج، چ، ح، خ، د، ڈ، ذ، ر، ڑ، ز، ژ، س، ش، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، و، ہ، ھ، ء، ی، ے۔

Speaker Details

Speaker-1 (ID: pnb1): Male, 36 years old
Speaker-2 (ID: pnb2): Male, 29 years old

Data Structure

The dataset is provided as a .gz archive.
The archive contains two speaker folders: Speaker 1 and Speaker 2.
Each folder includes approximately 5 hours of recordings.
Together, the two speakers make up around 10 hours of Punjabi TTS data.
Each speaker folder contains the audio recordings and their corresponding text/transcript files.