Common Voice Scripted Speech 25.0 - Czech

Čeština — Czech (`cs`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Czech [Čeština - cs]. The dataset contains 216875 clips representing 268.7 hours of recorded speech (81.09 hours validated) from 1134 speakers, recorded from a text corpus of 451,358 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		24,609 (11.3%)	155 (13.7%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	115,868 (53.4%)	385 (34.0%)
female_feminine	Female, feminine	47,471 (21.9%)	50 (4.4%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	53,536 (24.7%)	811 (71.5%)

Gender declared: 163,339 of 216,875 clips (75.3%), 323 of 1,134 speakers (28.5%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	5,328 (2.5%)	47 (4.1%)
twenties	Twenties	40,351 (18.6%)	157 (13.8%)
thirties	Thirties	101,274 (46.7%)	154 (13.6%)
fourties	Fourties	15,289 (7.0%)	75 (6.6%)
fifties	Fifties	2,675 (1.2%)	23 (2.0%)
sixties	Sixties	314 (0.1%)	5 (0.4%)
seventies	Seventies	20 (0.0%)	1 (0.1%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	51,624 (23.8%)	798 (70.4%)

Age declared: 165,251 of 216,875 clips (76.2%), 336 of 1,134 speakers (29.6%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	65,452 (30.2%)
Invalidated	2,665 (1.2%)
Other	148,758 (68.6%)

Training splits

Split	Clips
Train	22,269 (34.0%)
Dev	9,473 (14.5%)
Test	9,463 (14.5%)

Training split coverage: 41,205 of 65,452 validated clips (63.0%)

The dataset contains 65452 validated, 2665 invalidated, and 148758 unresolved clips. The average clip duration is 4.46 seconds.

Text corpus

Validated sentences: 451,253

Category	Count
Unvalidated sentences	105
Pending sentences	82
Rejected sentences	23
Reported sentences	966

The corpus contains 451,358 sentences: 451,253 validated and 105 unvalidated (82 pending review, 23 rejected), with 966 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Zde byl vyhlášen nejlepším brankářem ligy.
Viděla tři postavy na blízkém kopci na koni s bajonetem.
Dopravní spojení s centrem města a ostatními městskými částmi zajišťuje Dopravní podnik města Košice.
Měli dva páry ploutví a živili se rybami a hlavonožci.
Závěr filmu je zcela odlišný od knižní předlohy.

Sources

Source	Sentences
wiki	342,013 (75.8%)
europarl-v7-cs	98,821 (21.9%)
sentence-collector	9,834 (2.2%)
Other	585 (0.1%)

Text domains

Code	Domain	Clips	Speakers
general	General	15 (0.0%)	10 (0.9%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	1 (0.0%)	1 (0.1%)
finance	Finance	-	-
service_retail	Service and Retail	-	-
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	1 (0.0%)	1 (0.1%)
media_entertainment	Media and Entertainment	4 (0.0%)	4 (0.4%)
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	-	-
technology_robotics	Technology and Robotics	8 (0.0%)	7 (0.6%)
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Czech

Description

Specifics

Considerations

Processes

Metadata

Čeština — Czech (`cs`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Czech

Description

Specifics

Considerations

Processes

Metadata

Čeština — Czech (cs)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Čeština — Czech (`cs`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`