Common Voice Scripted Speech 25.0 - Armenian

Հայերեն — Armenian (`hy-AM`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Armenian [Հայերեն - hy-AM]. The dataset contains 38398 clips representing 57.41 hours of recorded speech (34.31 hours validated) from 586 speakers, recorded from a text corpus of 238,796 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		8,815 (23.0%)	64 (10.9%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	9,238 (24.1%)	75 (12.8%)
female_feminine	Female, feminine	20,852 (54.3%)	115 (19.6%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	8,308 (21.6%)	438 (74.7%)

Gender declared: 30,090 of 38,398 clips (78.4%), 148 of 586 speakers (25.3%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	4,086 (10.6%)	25 (4.3%)
twenties	Twenties	23,584 (61.4%)	146 (24.9%)
thirties	Thirties	2,840 (7.4%)	26 (4.4%)
fourties	Fourties	1,429 (3.7%)	8 (1.4%)
fifties	Fifties	265 (0.7%)	3 (0.5%)
sixties	Sixties	-	-
seventies	Seventies	-	-
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	6,194 (16.1%)	422 (72.0%)

Age declared: 32,204 of 38,398 clips (83.9%), 164 of 586 speakers (28.0%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	22,953 (59.8%)
Invalidated	1,399 (3.6%)
Other	14,046 (36.6%)

Training splits

Split	Clips
Train	10,501 (45.8%)
Dev	5,968 (26.0%)
Test	6,245 (27.2%)

Training split coverage: 22,714 of 22,953 validated clips (99.0%)

The dataset contains 22953 validated, 1399 invalidated, and 14046 unresolved clips. The average clip duration is 5.383 seconds.

Text corpus

Validated sentences: 237,419

Category	Count
Unvalidated sentences	1,377
Pending sentences	1,349
Rejected sentences	28
Reported sentences	184

The corpus contains 238,796 sentences: 237,419 validated and 1,377 unvalidated (1,349 pending review, 28 rejected), with 184 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Մարմնավորվում է մարտնչող ռազմիկի տեսքով՝ նստած թռչող այծի վրա՝ գահավորակը ձեռքին։
Ցենտրիոլների և ցենտրոսֆերայի ամբողջությունը կոչվում է բջջային կենտրոն։
Հոկտեմբեր կը համարուի անցողիկ ամիս՝ անձրեւային եւ ցուրտ ժամանակաշրջաններու միջեւ։
Քույրերի միջև վեճ է սկսվում։
ասում են՝ դա մահվան նշան է…

Sources

Source	Sentences
hy_wiki	216,997 (91.4%)
ՊԱՅՔԱՐ 1898	2,874 (1.2%)
Զազունյան 1890	2,652 (1.1%)
Other	14,896 (6.3%)

Text domains

Code	Domain	Clips	Speakers
general	General	1 (0.0%)	1 (0.2%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	-	-
finance	Finance	1 (0.0%)	1 (0.2%)
service_retail	Service and Retail	-	-
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	-	-
media_entertainment	Media and Entertainment	-	-
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	-	-
technology_robotics	Technology and Robotics	-	-
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Armenian

Description

Specifics

Considerations

Processes

Metadata

Հայերեն — Armenian (`hy-AM`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Armenian

Description

Specifics

Considerations

Processes

Metadata

Հայերեն — Armenian (hy-AM)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Հայերեն — Armenian (`hy-AM`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`