Kannada Time Aligned Speech Corpus

Description

The Kannada Time-Aligned Speech Corpus is a 5-hour speech dataset containing Kannada audio with corresponding time-aligned transcriptions. It is designed to support speech technology and research tasks such as automatic speech recognition, forced alignment, speech segmentation, pronunciation modeling, and spoken language analysis. The dataset provides a useful resource for developing and evaluating Kannada language technologies.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

Use is permitted with attribution for non-commercial purposes only, and any shared adaptations must be distributed under the same license terms.

Forbidden Usage

Forbidden uses include commercial use, redistribution without proper attribution, and sharing modified versions under a different license.

Language

Kannada is a major Dravidian language primarily spoken in the Indian state of Karnataka and by Kannada-speaking communities in other parts of India and abroad. It has a long literary history, a rich written tradition, and its own script. Kannada is widely used in education, media, administration, literature, and everyday communication, making it one of the most important languages of South India.

Data Structure

The dataset is organized into two main folders:

Audio/ — contains the Kannada speech recordings
Transcription/ — contains the corresponding text transcriptions for each audio file

Each transcription file corresponds to an audio file, making the dataset easy to use for speech processing, alignment, and transcription-based tasks.

Speaker Information

The dataset includes recordings from two native Kannada speakers:

Speaker 1: Male, 32 years old
Speaker 2: Female, 39 years old

This provides basic speaker diversity in terms of gender and age within the corpus.

Recommended Processing

Verify audio quality
Normalize transcription text
Match audio and transcription filenames
Check alignment consistency
Remove noisy or corrupted files
Standardize formats and metadata

Sample

1
00:00:00,001 --> 00:00:02,956
ನಾನು ಇಂದು ಶಿಕ್ಷಣದ ಬಗ್ಗೆ ಮಾತನಾಡಲ್ಲ ಶಿಕ್ಷಣದ

2
00:00:02,980 --> 00:00:04,783
ಮಹತ್ವದ ಬಗ್ಗೆ

3
00:00:04,807 --> 00:00:06,031
ಮಾತನಾಡಲು ಹೊರಟಿದ್ದೇನೆ.