Common Voice Scripted Speech 26.0 - Cantonese

Loading dataset...

Loading datasets...

Common Voice Scripted Speech 26.0 - Cantonese | Mozilla Data Collective

粵語 — Cantonese (`yue`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Cantonese [粵語 - yue]. The dataset contains 279366 clips representing 307.41 hours of recorded speech (210.7 hours validated) from 1183 speakers, recorded from a text corpus of 28,727 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		126,764 (45.4%)	145 (12.3%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	47,656 (17.1%)	114 (9.6%)
female_feminine	Female, feminine	174,436 (62.4%)	132 (11.2%)
transgender	Transgender	-	-
non-binary	Non-binary	1,014 (0.4%)	2 (0.2%)
do_not_wish_to_say	Prefer not to say	59 (0.0%)	1 (0.1%)
-	Unspecified	56,201 (20.1%)	1,008 (85.2%)

Gender declared: 223,165 of 279,366 clips (79.9%), 175 of 1,183 speakers (14.8%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	1,999 (0.7%)	20 (1.7%)
twenties	Twenties	81,658 (29.2%)	118 (10.0%)
thirties	Thirties	131,250 (47.0%)	93 (7.9%)
fourties	Fourties	10,231 (3.7%)	40 (3.4%)
fifties	Fifties	222 (0.1%)	5 (0.4%)
sixties	Sixties	600 (0.2%)	4 (0.3%)
seventies	Seventies	-	-
eighties	Eighties	-	-
nineties	Nineties	1,004 (0.4%)	1 (0.1%)
-	Unspecified	52,402 (18.8%)	989 (83.6%)

Age declared: 226,964 of 279,366 clips (81.2%), 194 of 1,183 speakers (16.4%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	191,477 (68.5%)
Invalidated	8,096 (2.9%)
Other	79,793 (28.6%)

Training splits

Split	Clips
Train	7,420 (3.9%)
Dev	5,130 (2.7%)
Test	5,130 (2.7%)

Training split coverage: 17,680 of 191,477 validated clips (9.2%)

The dataset contains 191477 validated, 8096 invalidated, and 79793 unresolved clips. The average clip duration is 3.961 seconds.

Text corpus

Validated sentences: 18,556

Category	Count
Unvalidated sentences	10,171
Pending sentences	10,157
Rejected sentences	14
Reported sentences	2,257

The corpus contains 28,727 sentences: 18,556 validated and 10,171 unvalidated (10,157 pending review, 14 rejected), with 2,257 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

睇定個大概先
但係佢話教咗好耐學生都未熟五十音
如果中文係漢字，噉日本越南喃字都係中文囉
不過，有得打字
噉你都啱

Sources

Source	Sentences
sentence-collector	9,444 (50.9%)
Personal chat history	6,001 (32.3%)
Group chat history	2,804 (15.1%)
Other	307 (1.7%)

Text domains

Code	Domain	Clips	Speakers
general	General	1,682 (0.6%)	83 (7.0%)
agriculture_food	Agriculture and Food	18 (0.0%)	18 (1.5%)
automotive_transport	Automotive and Transport	52 (0.0%)	30 (2.5%)
finance	Finance	128 (0.0%)	32 (2.7%)
service_retail	Service and Retail	613 (0.2%)	52 (4.4%)
healthcare	Healthcare	417 (0.1%)	48 (4.1%)
history_law_government	History, Law and Government	67 (0.0%)	28 (2.4%)
media_entertainment	Media and Entertainment	192 (0.1%)	36 (3.0%)
nature_environment	Nature and Environment	188 (0.1%)	39 (3.3%)
news_current_affairs	News and Current Affairs	114 (0.0%)	31 (2.6%)
technology_robotics	Technology and Robotics	95 (0.0%)	30 (2.5%)
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Cantonese

Description

Specifics

Considerations

Processes

Metadata

粵語 — Cantonese (`yue`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Cantonese

Description

Specifics

Considerations

Processes

Metadata

粵語 — Cantonese (yue)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

粵語 — Cantonese (`yue`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`