Common Voice Scripted Speech 25.0 - Belarusian

Беларуская — Belarusian (`be`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Belarusian [Беларуская - be]. The dataset contains 1419713 clips representing 1890.19 hours of recorded speech (1816.01 hours validated) from 8604 speakers, recorded from a text corpus of 381,479 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		827 (0.1%)	15 (0.2%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	81,495 (5.7%)	516 (6.0%)
female_feminine	Female, feminine	102,005 (7.2%)	724 (8.4%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	40 (0.0%)	1 (0.0%)
-	Unspecified	1,236,173 (87.1%)	8,088 (94.0%)

Gender declared: 183,540 of 1,419,713 clips (12.9%), 516 of 8,604 speakers (6.0%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	10,698 (0.8%)	84 (1.0%)
twenties	Twenties	48,758 (3.4%)	540 (6.3%)
thirties	Thirties	61,914 (4.4%)	479 (5.6%)
fourties	Fourties	54,488 (3.8%)	127 (1.5%)
fifties	Fifties	1,323 (0.1%)	22 (0.3%)
sixties	Sixties	1,239 (0.1%)	8 (0.1%)
seventies	Seventies	103 (0.0%)	3 (0.0%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	1,241,190 (87.4%)	8,074 (93.8%)

Age declared: 178,523 of 1,419,713 clips (12.6%), 530 of 8,604 speakers (6.2%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	1,363,996 (96.1%)
Invalidated	36,985 (2.6%)
Other	18,732 (1.3%)

Training splits

Split	Clips
Train	347,710 (25.5%)
Dev	15,879 (1.2%)
Test	15,875 (1.2%)

Training split coverage: 379,464 of 1,363,996 validated clips (27.8%)

The dataset contains 1363996 validated, 36985 invalidated, and 18732 unresolved clips. The average clip duration is 4.793 seconds.

Text corpus

Validated sentences: 379,562

Category	Count
Unvalidated sentences	1,917
Pending sentences	1,768
Rejected sentences	149
Reported sentences	3,217

The corpus contains 381,479 sentences: 379,562 validated and 1,917 unvalidated (1,768 pending review, 149 rejected), with 3,217 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Можа, прыйшоў час усё растлумачыць?
У імя гэтага ідэалу ўвесь гэты час ідзе змаганне.
За гэты час ніхто не з'явіўся.
Возера зімой большай часткай пакрыта снегам і лёдам.
Тады межы былі цалкам перакрытыя.

Sources

Source	Sentences
euroradio-sampled-sentences-non-blacklisted	97,751 (25.8%)
wiki	85,295 (22.5%)
knihi	69,546 (18.3%)
novychas-sampled-sentences-non-blacklisted	52,619 (13.9%)
nashaniva-sampled-sentences-non-blacklisted	37,042 (9.8%)
euroradio-sampled-sentences-non-blacklisted-batch2	18,692 (4.9%)
novychas-sampled-sentences-non-blacklisted-batch2	17,218 (4.5%)
Other	1,399 (0.4%)

Text domains

Code	Domain	Clips	Speakers
general	General	33 (0.0%)	13 (0.2%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	-	-
finance	Finance	-	-
service_retail	Service and Retail	4 (0.0%)	4 (0.0%)
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	48 (0.0%)	19 (0.2%)
media_entertainment	Media and Entertainment	-	-
nature_environment	Nature and Environment	4 (0.0%)	4 (0.0%)
news_current_affairs	News and Current Affairs	8 (0.0%)	6 (0.1%)
technology_robotics	Technology and Robotics	8 (0.0%)	7 (0.1%)
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Belarusian

Description

Specifics

Considerations

Processes

Metadata

Беларуская — Belarusian (`be`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Belarusian

Description

Specifics

Considerations

Processes

Metadata

Беларуская — Belarusian (be)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Беларуская — Belarusian (`be`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`