Common Voice Scripted Speech 25.0 - Bashkir

Башҡорт — Bashkir (`ba`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Bashkir [Башҡорт - ba]. The dataset contains 218518 clips representing 268.71 hours of recorded speech (258.81 hours validated) from 930 speakers, recorded from a text corpus of 153,973 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		613 (0.3%)	5 (0.5%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	66,257 (30.3%)	119 (12.8%)
female_feminine	Female, feminine	86,091 (39.4%)	209 (22.5%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	66,170 (30.3%)	782 (84.1%)

Gender declared: 152,348 of 218,518 clips (69.7%), 148 of 930 speakers (15.9%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	9,292 (4.3%)	47 (5.1%)
twenties	Twenties	38,086 (17.4%)	128 (13.8%)
thirties	Thirties	37,883 (17.3%)	91 (9.8%)
fourties	Fourties	13,047 (6.0%)	35 (3.8%)
fifties	Fifties	11,608 (5.3%)	13 (1.4%)
sixties	Sixties	42,367 (19.4%)	12 (1.3%)
seventies	Seventies	40 (0.0%)	1 (0.1%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	66,195 (30.3%)	783 (84.2%)

Age declared: 152,323 of 218,518 clips (69.7%), 147 of 930 speakers (15.8%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	210,463 (96.3%)
Invalidated	8,012 (3.7%)
Other	43 (0.0%)

Training splits

Split	Clips
Train	119,133 (56.6%)
Dev	14,527 (6.9%)
Test	14,572 (6.9%)

Training split coverage: 148,232 of 210,463 validated clips (70.4%)

The dataset contains 210463 validated, 8012 invalidated, and 43 unresolved clips. The average clip duration is 4.427 seconds.

Text corpus

Validated sentences: 153,968

Category	Count
Unvalidated sentences	5
Pending sentences	-
Rejected sentences	5
Reported sentences	864

The corpus contains 153,973 sentences: 153,968 validated and 5 unvalidated (0 pending review, 5 rejected), with 864 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Һау булмаһам, һау итәңме?
Иң мөһиме — тармаҡтың һәр бер йүнәлешендә лә үҫеш бар.
Һабала эшләнгән ҡымыҙ хайран тәмле була.
Һине ерләргә.
—Уға беренсе фигураны күрһәтәйек.

Sources

Source	Sentences
sentence-collector	153,957 (100.0%)
Other	11 (0.0%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Bashkir

Description

Specifics

Considerations

Processes

Metadata

Башҡорт — Bashkir (`ba`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Bashkir

Description

Specifics

Considerations

Processes

Metadata

Башҡорт — Bashkir (ba)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Башҡорт — Bashkir (`ba`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`