Common Voice Scripted Speech 26.0 - Interlingua

Interlingua — Interlingua (`ia`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Interlingua [Interlingua - ia]. The dataset contains 14770 clips representing 17.24 hours of recorded speech (14.27 hours validated) from 72 speakers, recorded from a text corpus of 9,214 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		15 (0.1%)	2 (2.8%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	9,033 (61.2%)	19 (26.4%)
female_feminine	Female, feminine	74 (0.5%)	3 (4.2%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	1 (0.0%)	1 (1.4%)
-	Unspecified	5,662 (38.3%)	54 (75.0%)

Gender declared: 9,108 of 14,770 clips (61.7%), 18 of 72 speakers (25.0%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	36 (0.2%)	2 (2.8%)
twenties	Twenties	702 (4.8%)	7 (9.7%)
thirties	Thirties	286 (1.9%)	6 (8.3%)
fourties	Fourties	4,337 (29.4%)	2 (2.8%)
fifties	Fifties	420 (2.8%)	4 (5.6%)
sixties	Sixties	20 (0.1%)	1 (1.4%)
seventies	Seventies	3,307 (22.4%)	1 (1.4%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	5,662 (38.3%)	54 (75.0%)

Age declared: 9,108 of 14,770 clips (61.7%), 18 of 72 speakers (25.0%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	12,229 (82.8%)
Invalidated	362 (2.5%)
Other	2,179 (14.8%)

Training splits

Split	Clips
Train	4,886 (40.0%)
Dev	1,898 (15.5%)
Test	1,897 (15.5%)

Training split coverage: 8,681 of 12,229 validated clips (71.0%)

The dataset contains 12229 validated, 362 invalidated, and 2179 unresolved clips. The average clip duration is 4.202 seconds.

Text corpus

Validated sentences: 9,069

Category	Count
Unvalidated sentences	145
Pending sentences	145
Rejected sentences	-
Reported sentences	274

The corpus contains 9,214 sentences: 9,069 validated and 145 unvalidated (145 pending review, 0 rejected), with 274 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Ille repeteva lentemente su nomine.
Nos plantava un picea in le parco.
Nos la eligeva pro parlar a nostre inseniante sur le question.
Que nos postpone le viage inaugural.
Il habeva euphoria al quartiero general.

Sources

Source	Sentences
sentence-collector	9,020 (99.5%)
Other	49 (0.5%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Interlingua

Description

Specifics

Considerations

Processes

Metadata

Interlingua — Interlingua (`ia`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Interlingua

Description

Specifics

Considerations

Processes

Metadata

Interlingua — Interlingua (ia)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Interlingua — Interlingua (`ia`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`