Common Voice Scripted Speech 26.0 - Persian

فارسی — Persian (`fa`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Persian [فارسی - fa]. The dataset contains 394397 clips representing 430.86 hours of recorded speech (373.24 hours validated) from 4660 speakers, recorded from a text corpus of 354,627 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		91,066 (23.1%)	114 (2.4%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	275,796 (69.9%)	1,262 (27.1%)
female_feminine	Female, feminine	25,280 (6.4%)	254 (5.5%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	93,321 (23.7%)	3,247 (69.7%)

Gender declared: 301,076 of 394,397 clips (76.3%), 1,413 of 4,660 speakers (30.3%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	11,601 (2.9%)	132 (2.8%)
twenties	Twenties	129,859 (32.9%)	849 (18.2%)
thirties	Thirties	140,062 (35.5%)	479 (10.3%)
fourties	Fourties	9,840 (2.5%)	88 (1.9%)
fifties	Fifties	5,745 (1.5%)	24 (0.5%)
sixties	Sixties	181 (0.0%)	4 (0.1%)
seventies	Seventies	36 (0.0%)	1 (0.0%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	97,073 (24.6%)	3,199 (68.6%)

Age declared: 297,324 of 394,397 clips (75.4%), 1,461 of 4,660 speakers (31.4%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	341,657 (86.6%)
Invalidated	15,500 (3.9%)
Other	37,240 (9.4%)

Training splits

Split	Clips
Train	30,385 (8.9%)
Dev	10,752 (3.1%)
Test	10,752 (3.1%)

Training split coverage: 51,889 of 341,657 validated clips (15.2%)

The dataset contains 341657 validated, 15500 invalidated, and 37240 unresolved clips. The average clip duration is 3.933 seconds.

Text corpus

Validated sentences: 58,854

Category	Count
Unvalidated sentences	295,773
Pending sentences	294,668
Rejected sentences	1,105
Reported sentences	3,370

The corpus contains 354,627 sentences: 58,854 validated and 295,773 unvalidated (294,668 pending review, 1,105 rejected), with 3,370 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

پیشاپیش از همراهی و حضور شما سپاسگزاریم.
نامیبیا
او توپ را به طرف دروازه شوت کرد.
زرپرستی کردن
این رمان در سه سطح ساخته شده است.

Sources

Source	Sentences
sentence-collector	44,442 (81.0%)
self-prepared sentences	1,961 (3.6%)
self-prepared vocabulary and sentence list	1,562 (2.8%)
Other	6,931 (12.6%)

Text domains

Code	Domain	Clips	Speakers
general	General	52 (0.0%)	19 (0.4%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	11 (0.0%)	6 (0.1%)
finance	Finance	4 (0.0%)	2 (0.0%)
service_retail	Service and Retail	-	-
healthcare	Healthcare	2 (0.0%)	1 (0.0%)
history_law_government	History, Law and Government	5 (0.0%)	1 (0.0%)
media_entertainment	Media and Entertainment	-	-
nature_environment	Nature and Environment	9 (0.0%)	4 (0.1%)
news_current_affairs	News and Current Affairs	-	-
technology_robotics	Technology and Robotics	10 (0.0%)	2 (0.0%)
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Persian

Description

Specifics

Considerations

Processes

Metadata

فارسی — Persian (`fa`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Persian

Description

Specifics

Considerations

Processes

Metadata

فارسی — Persian (fa)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

فارسی — Persian (`fa`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`