Common Voice Scripted Speech 25.0 - Polish

polski — Polish (`pl`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Polish [polski - pl]. The dataset contains 148690 clips representing 188.84 hours of recorded speech (176.61 hours validated) from 3465 speakers, recorded from a text corpus of 253,879 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		6,606 (4.4%)	77 (2.2%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	86,337 (58.1%)	699 (20.2%)
female_feminine	Female, feminine	20,283 (13.6%)	89 (2.6%)
transgender	Transgender	15 (0.0%)	1 (0.0%)
non-binary	Non-binary	152 (0.1%)	1 (0.0%)
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	41,903 (28.2%)	2,755 (79.5%)

Gender declared: 106,787 of 148,690 clips (71.8%), 710 of 3,465 speakers (20.5%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	2,974 (2.0%)	59 (1.7%)
twenties	Twenties	40,158 (27.0%)	382 (11.0%)
thirties	Thirties	51,108 (34.4%)	278 (8.0%)
fourties	Fourties	16,513 (11.1%)	78 (2.3%)
fifties	Fifties	924 (0.6%)	12 (0.3%)
sixties	Sixties	151 (0.1%)	7 (0.2%)
seventies	Seventies	5 (0.0%)	1 (0.0%)
eighties	Eighties	-	-
nineties	Nineties	790 (0.5%)	2 (0.1%)
-	Unspecified	36,067 (24.3%)	2,739 (79.0%)

Age declared: 112,623 of 148,690 clips (75.7%), 726 of 3,465 speakers (21.0%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	139,066 (93.5%)
Invalidated	6,991 (4.7%)
Other	2,633 (1.8%)

Training splits

Split	Clips
Train	25,458 (18.3%)
Dev	10,063 (7.2%)
Test	10,063 (7.2%)

Training split coverage: 45,584 of 139,066 validated clips (32.8%)

The dataset contains 139066 validated, 6991 invalidated, and 2633 unresolved clips. The average clip duration is 4.572 seconds.

Text corpus

Validated sentences: 230,282

Category	Count
Unvalidated sentences	23,597
Pending sentences	23,548
Rejected sentences	49
Reported sentences	603

The corpus contains 253,879 sentences: 230,282 validated and 23,597 unvalidated (23,548 pending review, 49 rejected), with 603 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

do niedawna kraje europejskie sprzedawały broń tym, którzy teraz nagle okazali się być dyktatorami
Komisja Prawna również zgłosiła tę uwagę
W Białorusi ciągle jeszcze są wyroki śmierci i ta kara śmierci jest stosowana
Fundusze gwarantowane nie istnieją i definicja ta powinna zostać usunięta z systemu
Poprzestanę na tym jednym pytaniu

Sources

Source	Sentences
selected-europarl-v7-pl	204,981 (89.0%)
sentence-collector	23,254 (10.1%)
Other	2,047 (0.9%)

Text domains

Code	Domain	Clips	Speakers
general	General	4 (0.0%)	4 (0.1%)
agriculture_food	Agriculture and Food	3 (0.0%)	1 (0.0%)
automotive_transport	Automotive and Transport	-	-
finance	Finance	13 (0.0%)	11 (0.3%)
service_retail	Service and Retail	-	-
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	-	-
media_entertainment	Media and Entertainment	-	-
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	2 (0.0%)	2 (0.1%)
technology_robotics	Technology and Robotics	4 (0.0%)	3 (0.1%)
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Polish

Description

Specifics

Considerations

Processes

Metadata

polski — Polish (`pl`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Polish

Description

Specifics

Considerations

Processes

Metadata

polski — Polish (pl)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

polski — Polish (`pl`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`