Common Voice Scripted Speech 25.0 - Chinese (China)

汉语（中国大陆） — Chinese (China) (`zh-CN`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Chinese (China) [汉语（中国大陆） - zh-CN]. The dataset contains 851641 clips representing 1073.77 hours of recorded speech (239.14 hours validated) from 7546 speakers, recorded from a text corpus of 60,052 sentences.

Language

Accents

Code	Accent	Clips	Speakers
110000	出生地：11 北京市	7,174 (0.8%)	128 (1.7%)
360000	出生地：36 江西省	5,638 (0.7%)	41 (0.5%)
440000	出生地：44 广东省	4,621 (0.5%)	94 (1.2%)
230000	出生地：23 黑龙江省	3,971 (0.5%)	45 (0.6%)
320000	出生地：32 江苏省	3,913 (0.5%)	95 (1.3%)
370000	出生地：37 山东省	3,701 (0.4%)	98 (1.3%)
310000	出生地：31 上海市	3,526 (0.4%)	59 (0.8%)
330000	出生地：33 浙江省	2,917 (0.3%)	87 (1.2%)
210000	出生地：21 辽宁省	2,904 (0.3%)	42 (0.6%)
120000	出生地：12 天津市	2,889 (0.3%)	19 (0.3%)
510000	出生地：51 四川省	2,698 (0.3%)	74 (1.0%)
410000	出生地：41 河南省	2,275 (0.3%)	68 (0.9%)
130000	出生地：13 河北省	1,930 (0.2%)	58 (0.8%)
350000	出生地：35 福建省	1,810 (0.2%)	36 (0.5%)
420000	出生地：42 湖北省	1,792 (0.2%)	54 (0.7%)
450000	出生地：45 广西壮族自治区	1,685 (0.2%)	24 (0.3%)
340000	出生地：34 安徽省	1,411 (0.2%)	42 (0.6%)
500000	出生地：50 重庆市	1,404 (0.2%)	21 (0.3%)
140000	出生地：14 山西省	1,391 (0.2%)	30 (0.4%)
430000	出生地：43 湖南省	1,053 (0.1%)	51 (0.7%)
610000	出生地：61 陕西省	685 (0.1%)	37 (0.5%)
220000	出生地：22 吉林省	534 (0.1%)	24 (0.3%)
640000	出生地：64 宁夏回族自治区	414 (0.0%)	6 (0.1%)
650000	出生地：65 新疆维吾尔自治区	355 (0.0%)	18 (0.2%)
460000	出生地：46 海南省	331 (0.0%)	2 (0.0%)
150000	出生地：15 内蒙古自治区	277 (0.0%)	17 (0.2%)
530000	出生地：53 云南省	240 (0.0%)	14 (0.2%)
620000	出生地：62 甘肃省	182 (0.0%)	16 (0.2%)
520000	出生地：52 贵州省	164 (0.0%)	13 (0.2%)
810000	出生地：81 香港特别行政区	113 (0.0%)	4 (0.1%)
710000	出生地：71 台湾省	85 (0.0%)	2 (0.0%)
630000	出生地：63 青海省	5 (0.0%)	1 (0.0%)
540000	出生地：54 西藏自治区	5 (0.0%)	1 (0.0%)
-	Other	3,898 (0.5%)	79 (1.0%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	48,479 (5.7%)	975 (12.9%)
female_feminine	Female, feminine	12,282 (1.4%)	255 (3.4%)
transgender	Transgender	-	-
non-binary	Non-binary	238 (0.0%)	2 (0.0%)
do_not_wish_to_say	Prefer not to say	536 (0.1%)	9 (0.1%)
-	Unspecified	790,106 (92.8%)	6,434 (85.3%)

Gender declared: 61,535 of 851,641 clips (7.2%), 1,112 of 7,546 speakers (14.7%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	11,550 (1.4%)	230 (3.0%)
twenties	Twenties	41,678 (4.9%)	877 (11.6%)
thirties	Thirties	11,719 (1.4%)	192 (2.5%)
fourties	Fourties	2,192 (0.3%)	59 (0.8%)
fifties	Fifties	165 (0.0%)	10 (0.1%)
sixties	Sixties	6 (0.0%)	2 (0.0%)
seventies	Seventies	5 (0.0%)	1 (0.0%)
eighties	Eighties	-	-
nineties	Nineties	30 (0.0%)	2 (0.0%)
-	Unspecified	784,296 (92.1%)	6,318 (83.7%)

Age declared: 67,345 of 851,641 clips (7.9%), 1,228 of 7,546 speakers (16.3%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	189,674 (22.3%)
Invalidated	59,226 (7.0%)
Other	602,741 (70.8%)

Training splits

Split	Clips
Train	29,608 (15.6%)
Dev	10,653 (5.6%)
Test	10,653 (5.6%)

Training split coverage: 50,914 of 189,674 validated clips (26.8%)

The dataset contains 189674 validated, 59226 invalidated, and 602741 unresolved clips. The average clip duration is 4.539 seconds.

Text corpus

Validated sentences: 59,143

Category	Count
Unvalidated sentences	909
Pending sentences	22
Rejected sentences	887
Reported sentences	1,145

The corpus contains 60,052 sentences: 59,143 validated and 909 unvalidated (22 pending review, 887 rejected), with 1,145 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

这可能导致真正的社会不平等和不公正。
京沈公路过境。
平谷区长城列表旨在列出中国北京市平谷区的长城墙体及附属设施。
但高速铁路毋须自行驾车会较为舒适。
归入第五批全国重点文物保护单位直波碉楼。

Sources

Source	Sentences
wiki	54,638 (92.4%)
cn	2,881 (4.9%)
Other	1,623 (2.7%)

Text domains

Code	Domain	Clips	Speakers
general	General	890 (0.1%)	153 (2.0%)
agriculture_food	Agriculture and Food	55 (0.0%)	36 (0.5%)
automotive_transport	Automotive and Transport	75 (0.0%)	54 (0.7%)
finance	Finance	111 (0.0%)	53 (0.7%)
service_retail	Service and Retail	58 (0.0%)	44 (0.6%)
healthcare	Healthcare	139 (0.0%)	71 (0.9%)
history_law_government	History, Law and Government	394 (0.0%)	111 (1.5%)
media_entertainment	Media and Entertainment	1,672 (0.2%)	156 (2.1%)
nature_environment	Nature and Environment	63 (0.0%)	42 (0.6%)
news_current_affairs	News and Current Affairs	194 (0.0%)	74 (1.0%)
technology_robotics	Technology and Robotics	249 (0.0%)	88 (1.2%)
language_fundamentals	Language Fundamentals	102 (0.0%)	61 (0.8%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Chinese (China)

Description

Specifics

Considerations

Processes

Metadata

汉语（中国大陆） — Chinese (China) (`zh-CN`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Chinese (China)

Description

Specifics

Considerations

Processes

Metadata

汉语（中国大陆） — Chinese (China) (zh-CN)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

汉语（中国大陆） — Chinese (China) (`zh-CN`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`