Common Voice Scripted Speech 25.0 - Japanese

日本語 — Japanese (`ja`)

This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Japanese [日本語 - ja]. The dataset contains 584432 clips representing 725.11 hours of recorded speech (371.92 hours validated) from 7813 speakers, recorded from a text corpus of 51,207 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		53,100 (9.1%)	431 (5.5%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	156,678 (26.8%)	1,744 (22.3%)
female_feminine	Female, feminine	228,052 (39.0%)	2,261 (28.9%)
transgender	Transgender	203 (0.0%)	2 (0.0%)
non-binary	Non-binary	143 (0.0%)	4 (0.1%)
do_not_wish_to_say	Prefer not to say	5,158 (0.9%)	73 (0.9%)
-	Unspecified	194,068 (33.2%)	4,935 (63.2%)

Gender declared: 390,364 of 584,432 clips (66.8%), 2,878 of 7,813 speakers (36.8%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	47,749 (8.2%)	636 (8.1%)
twenties	Twenties	318,967 (54.6%)	4,004 (51.2%)
thirties	Thirties	40,270 (6.9%)	282 (3.6%)
fourties	Fourties	48,396 (8.3%)	265 (3.4%)
fifties	Fifties	25,213 (4.3%)	215 (2.8%)
sixties	Sixties	5,435 (0.9%)	53 (0.7%)
seventies	Seventies	578 (0.1%)	7 (0.1%)
eighties	Eighties	-	-
nineties	Nineties	110 (0.0%)	2 (0.0%)
-	Unspecified	97,714 (16.7%)	3,932 (50.3%)

Age declared: 486,718 of 584,432 clips (83.3%), 3,881 of 7,813 speakers (49.7%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	299,767 (51.3%)
Invalidated	55,101 (9.4%)
Other	229,564 (39.3%)

Training splits

Split	Clips
Train	19,695 (6.6%)
Dev	9,019 (3.0%)
Test	9,019 (3.0%)

Training split coverage: 37,733 of 299,767 validated clips (12.6%)

The dataset contains 299767 validated, 55101 invalidated, and 229564 unresolved clips. The average clip duration is 4.467 seconds.

Text corpus

Validated sentences: 43,942

Category	Count
Unvalidated sentences	7,265
Pending sentences	4,718
Rejected sentences	2,547
Reported sentences	791

The corpus contains 51,207 sentences: 43,942 validated and 7,265 unvalidated (4,718 pending review, 2,547 rejected), with 791 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

今の自民党政権に戻るとの意味では全くない。中道改革勢力を結集し、私たちが政権を担える政治を目指すということだ
金融機関に連絡する義務がある。
本人が謝るのが筋だろうに
そして、これが空いている時間です。
I want to eat an apple!

Sources

Source	Sentences
sentence-collector	11,437 (26.1%)
covost2-en_ja	6,713 (15.3%)
自己引用	3,171 (7.2%)
Self Citation	2,798 (6.4%)
JSUT	2,176 (5.0%)
Aozora Bunko	1,994 (4.5%)
yumie-text-1	1,579 (3.6%)
https://slib.net/50151	1,229 (2.8%)
?????	786 (1.8%)
Other	12,010 (27.4%)

Text domains

Code	Domain	Clips	Speakers
general	General	24,569 (4.2%)	2,845 (36.4%)
agriculture_food	Agriculture and Food	9,173 (1.6%)	2,319 (29.7%)
automotive_transport	Automotive and Transport	2,152 (0.4%)	1,323 (16.9%)
finance	Finance	2,214 (0.4%)	1,249 (16.0%)
service_retail	Service and Retail	490 (0.1%)	417 (5.3%)
healthcare	Healthcare	2,647 (0.5%)	1,512 (19.4%)
history_law_government	History, Law and Government	4,744 (0.8%)	1,630 (20.9%)
media_entertainment	Media and Entertainment	5,068 (0.9%)	1,789 (22.9%)
nature_environment	Nature and Environment	2,761 (0.5%)	1,530 (19.6%)
news_current_affairs	News and Current Affairs	1,068 (0.2%)	825 (10.6%)
technology_robotics	Technology and Robotics	674 (0.1%)	411 (5.3%)
language_fundamentals	Language Fundamentals	3,849 (0.7%)	1,657 (21.2%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
text - supposed transcription of the audio
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
segment - if sentence belongs to a custom dataset segment, it will be listed here
prompt_upvotes - number of upvotes the sentence prompt received
prompt_reports - number of reports the sentence prompt received
is_edited - whether the clip's transcription has been edited

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 25.0 - Japanese

Description

Specifics

Considerations

Processes

Metadata

日本語 — Japanese (`ja`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 25.0 - Japanese

Description

Specifics

Considerations

Processes

Metadata

日本語 — Japanese (ja)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

日本語 — Japanese (`ja`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`