Common Voice Scripted Speech 26.0 - Dargwa

Loading dataset...

Loading datasets...

Common Voice Scripted Speech 26.0 - Dargwa | Mozilla Data Collective

Дарган — Dargwa (`dar`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Dargwa [Дарган - dar]. The dataset contains 12616 clips representing 19.91 hours of recorded speech (15.34 hours validated) from 44 speakers, recorded from a text corpus of 7,088 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		911 (7.2%)	4 (9.1%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	-	-
female_feminine	Female, feminine	4,835 (38.3%)	7 (15.9%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	7,781 (61.7%)	41 (93.2%)

Gender declared: 4,835 of 12,616 clips (38.3%), 3 of 44 speakers (6.8%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	158 (1.3%)	2 (4.5%)
twenties	Twenties	3,509 (27.8%)	10 (22.7%)
thirties	Thirties	70 (0.6%)	1 (2.3%)
fourties	Fourties	1,986 (15.7%)	1 (2.3%)
fifties	Fifties	4,300 (34.1%)	1 (2.3%)
sixties	Sixties	-	-
seventies	Seventies	-	-
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	2,593 (20.6%)	37 (84.1%)

Age declared: 10,023 of 12,616 clips (79.4%), 7 of 44 speakers (15.9%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	9,717 (77.0%)
Invalidated	188 (1.5%)
Other	2,711 (21.5%)

Training splits

Split	Clips
Train	2,096 (21.6%)
Dev	1,616 (16.6%)
Test	1,431 (14.7%)

Training split coverage: 5,143 of 9,717 validated clips (52.9%)

The dataset contains 9717 validated, 188 invalidated, and 2711 unresolved clips. The average clip duration is 5.684 seconds.

Text corpus

Validated sentences: 6,443

Category	Count
Unvalidated sentences	645
Pending sentences	499
Rejected sentences	146
Reported sentences	52

The corpus contains 7,088 sentences: 6,443 validated and 645 unvalidated (499 pending review, 146 rejected), with 52 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Сеналра асухӀебирар.
Душмантала кьасра ца сабри: Москвализи сабаэс!
Мурт гӀеркӀусив?
Гьанбиркахъис ца ишгъуна анцӀбукь.
Лебилра цалабикибтас СягӀид Амировли чӀумаси арадеш, талихӀ, эркиндеш ва сархибдешуни далгун.

Sources

Source	Sentences
Газета Замана	2,435 (37.8%)
ТУРЦИЯЛА ЗАКЛИУБ. Ибрагьим ИБРАГЬИМОВ	722 (11.2%)
Common phrases from daily spoken Dargwa	485 (7.5%)
Гьалад делкӏунти сагати хабурти	352 (5.5%)
Даргинская народная мудрость	289 (4.5%)
ТУРЦИЯЛА ЗАКЛИУБ. Ибрагьим Ибрагьимов	275 (4.3%)
Хабар «ГуглахӀяй»	214 (3.3%)
Бухъна махьила лягӏнат. Ибрагьим Ибрагьимов	177 (2.7%)
«ВиштӀаси талхъан». Антуан Сент-Экзюпери	170 (2.6%)
Other	1,324 (20.5%)

Text domains

Code	Domain	Clips	Speakers
general	General	9,906 (78.5%)	44 (100.0%)
agriculture_food	Agriculture and Food	466 (3.7%)	16 (36.4%)
automotive_transport	Automotive and Transport	-	-
finance	Finance	-	-
service_retail	Service and Retail	-	-
healthcare	Healthcare	126 (1.0%)	10 (22.7%)
history_law_government	History, Law and Government	1,598 (12.7%)	23 (52.3%)
media_entertainment	Media and Entertainment	4,691 (37.2%)	34 (77.3%)
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	3,871 (30.7%)	31 (70.5%)
technology_robotics	Technology and Robotics	-	-
language_fundamentals	Language Fundamentals	211 (1.7%)	10 (22.7%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Dargwa

Description

Specifics

Considerations

Processes

Metadata

Дарган — Dargwa (`dar`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Dargwa

Description

Specifics

Considerations

Processes

Metadata

Дарган — Dargwa (dar)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Дарган — Dargwa (`dar`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`