Common Voice Scripted Speech 26.0 - Uzbek

O‘zbek — Uzbek (`uz`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Uzbek [O‘zbek - uz]. The dataset contains 230183 clips representing 266 hours of recorded speech (101.18 hours validated) from 2309 speakers, recorded from a text corpus of 286,627 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		3,744 (1.6%)	91 (3.9%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	101,772 (44.2%)	632 (27.4%)
female_feminine	Female, feminine	34,271 (14.9%)	55 (2.4%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	94,140 (40.9%)	1,965 (85.1%)

Gender declared: 136,043 of 230,183 clips (59.1%), 344 of 2,309 speakers (14.9%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	40,799 (17.7%)	181 (7.8%)
twenties	Twenties	91,455 (39.7%)	462 (20.0%)
thirties	Thirties	1,716 (0.7%)	67 (2.9%)
fourties	Fourties	1,486 (0.6%)	11 (0.5%)
fifties	Fifties	80 (0.0%)	2 (0.1%)
sixties	Sixties	-	-
seventies	Seventies	-	-
eighties	Eighties	-	-
nineties	Nineties	5 (0.0%)	1 (0.0%)
-	Unspecified	94,642 (41.1%)	1,931 (83.6%)

Age declared: 135,541 of 230,183 clips (58.9%), 378 of 2,309 speakers (16.4%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	87,557 (38.0%)
Invalidated	14,183 (6.2%)
Other	128,443 (55.8%)

Training splits

Split	Clips
Train	49,033 (56.0%)
Dev	12,315 (14.1%)
Test	12,405 (14.2%)

Training split coverage: 73,753 of 87,557 validated clips (84.2%)

The dataset contains 87557 validated, 14183 invalidated, and 128443 unresolved clips. The average clip duration is 4.16 seconds.

Text corpus

Validated sentences: 148,738

Category	Count
Unvalidated sentences	137,889
Pending sentences	137,872
Rejected sentences	17
Reported sentences	1,816

The corpus contains 286,627 sentences: 148,738 validated and 137,889 unvalidated (137,872 pending review, 17 rejected), with 1,816 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

• Qaynona bir vaqtlar o‘zi ham kelin bo‘lganini unutadi-da, zulm qiladi.
Yurgan daryo, o‘tirgan bo‘yra...
• Tupuging og‘zingga qaytib tushmasin.
Ey, ustalar, nega diqqat bo‘layotibsizlar, nima gap
Tuman, shahar hamda viloyat darajasida qayta ko‘rib chiqish zarur

Sources

Source	Sentences
sentence-collector	148,606 (99.9%)
Other	132 (0.1%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Uzbek

Description

Specifics

Considerations

Processes

Metadata

O‘zbek — Uzbek (`uz`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Uzbek

Description

Specifics

Considerations

Processes

Metadata

O‘zbek — Uzbek (uz)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

O‘zbek — Uzbek (`uz`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`