Common Voice Spontaneous Speech 4.0 - Turkish

Türkçe — Turkish (`tr`)

This datasheet is for sps-corpus-4.0-2026-06-12 of the Mozilla Common Voice Spontaneous Speech dataset for Turkish [Türkçe - tr]. The dataset contains 52 clips representing 0.28 hours of recorded speech (0.19 hours validated) from 12 speakers.

Language

Turkish is the most widely spoken language among Turkic languages and has around 100 million L1 speakers, which makes it the 18th most spoken language. It is the national language of Turkey and one of two official languages of Cyprus, and secondary languages of some neighboring countries. Many smaller groups in other countries exist, through migrations or communities from Ottoman era. These smaller groups should usually be categorized as a variant.

Variants

There are currently no variants defined for Common Voice Turkish dataset. It is worth noting that, until now, this dataset focused on literary Turkish, often called "Turkish of Turkey". There are also some L2 voices, mostly from immigrants coming into the country, but these can be categorized as "foreign accents".

Data splits for modelling

The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.

Audio clips

Bucket	Clips	%
Transcribed & Validated	35	67.3%
Transcribed & Pending	8	15.4%
Not transcribed	9	17.3%

Training splits

Bucket	Clips	%
Train	0	0.0%
Dev	0	0.0%
Test	0	0.0%
Unassigned	52	100.0%

Training split coverage: 0 of 35 transcribed & validated clips (0.0%)

Transcriptions

Transcription status

Bucket	Clips	%
Validated	35	81.4%
Pending	8	18.6%
Edited	12	27.9%

Writing system

Turkish uses an extended Latin alphabet.

Symbol table

Official Alphabet:

Lowercase: a b c ç d e f g ğ h ı i j k l m n o ö p r s ş t u ü v y z
Uppercase: A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z

Auxilary Characters (Arabic/Farsi loanwords): â î û Â Î Û

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

Yüzme biliyor musun? Nerede öğrendin? Kim öğretti?
Farklı uçlarda düşünceleri olan insanlar sence ortak paydada buluşabilir mi?
Bir süper gücün olsa, ne olmasını isterdin?
İş hayatında en çok değer verdiğin şey nedir?
Bir gün boyunca tek bir aktivite yapabilecek olsan ne yapardın?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Tabi yeni çıkan bazı terimlerin karşılıkları hemen olmuyor ama zaman içinde üretiliyor. Fakat [disfluency] çok fazla insan [disfluency] şey yapıyor [disfluency] böyle yabancı dillerle karışık konuşmanın iyi birşey olduğunu sanarak, bunu bilerek kullanıyor yani biraz daha böyle [disfluency] iyi ... iyice öz türkçe olmak zorunda değil ama yani Türkçe karşılıkları var birçoğunun var önerilmiş, insanların dilini takip etmesi lazım.
Herhalde buna yanıtım, baklava olmalı, bir yığın çeşidi var, hepsini seviyorum. Aslında bütün tatlıları seviyorum, çünkü şekerim var yiyemiyorum, yoksa bir oturuşta bir kilo yerim.
Açıkçası Avrupa Birliği'nin tam üyesi olmak konusunda ne düşündüğümü sorarsanız, eee, yani şimdi Türkiye yani Türkiye tam üye olsa tabii ki iyi tarafları olacak ama işte ekonomik özgürlüğümüz biraz kısıtlanacak. Yani devletin ekonomiye yönetme hakkı biraz kısıtlanacak euroya geçtiğimiz için. Ama diğer yandan bakarsak yani iyi yönden bakarsak ülke değişimi, ülkeyi değişim derken yani mesela başka bir ülkeye gitmenin Avrupa Birliği'nin kolaylaşması işte gibi avantajlarda bulunmakta
Bazen severim, duruma göre değişir.
Mantı yapmak isterim. Önce hamuru tutarım, sonra iç harcını hazırlarım, soğan ve kıymayla. Ondan sonra oturup uzun uzun açıp, kesip kapatırım, sonra tepsiye dizip kuruturum birazcık, sonra da dondururum, veya yerim.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 5 minutes
- non-allowed-script - transcription contains characters from a writing system not associated with the language
- mixed-script-words - a single word contains characters from multiple writing systems
- mixed-script-transcription - transcription spans multiple writing systems, but each word consistently uses only one

Get involved

Community links

Main Channels:

Social media channels used during campaigns:

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2

Common Voice Spontaneous Speech 4.0 - Turkish

Description

Specifics

Considerations

Processes

Metadata

Türkçe — Turkish (`tr`)

Language

Variants

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Writing system

Symbol table

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Spontaneous Speech 4.0 - Turkish

Description

Specifics

Considerations

Processes

Metadata

Türkçe — Turkish (tr)

Language

Variants

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Writing system

Symbol table

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Türkçe — Turkish (`tr`)