Common Voice Scripted Speech 26.0 - Spanish

Loading dataset...

Loading datasets...

Common Voice Scripted Speech 26.0 - Spanish | Mozilla Data Collective

Español — Spanish (`es`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Spanish [Español - es]. The dataset contains 1680810 clips representing 2278.66 hours of recorded speech (595.45 hours validated) from 26940 speakers, recorded from a text corpus of 1,087,461 sentences.

Language

Accents

Code	Accent	Clips	Speakers
mexicano	México	879,480 (52.3%)	1,380 (5.1%)
surpeninsular	España: Sur peninsular (Andalucia, Extremadura, Murcia)	178,950 (10.6%)	261 (1.0%)
nortepeninsular	España: Norte peninsular (Asturias, Castilla y León, Cantabria, País Vasco, Navarra, Aragón, La Rioja, Guadalajara, Cuenca)	66,653 (4.0%)	638 (2.4%)
andino	Andino-Pacífico: Colombia, Perú, Ecuador, oeste de Bolivia y Venezuela andina	38,728 (2.3%)	947 (3.5%)
centrosurpeninsular	España: Centro-Sur peninsular (Madrid, Toledo, Castilla-La Mancha)	30,583 (1.8%)	511 (1.9%)
rioplatense	Rioplatense: Argentina, Uruguay, este de Bolivia, Paraguay	23,828 (1.4%)	579 (2.1%)
caribe	Caribe: Cuba, Venezuela, Puerto Rico, República Dominicana, Panamá, Colombia caribeña, México caribeño, Costa del golfo de México	21,835 (1.3%)	582 (2.2%)
canario	España: Islas Canarias	16,193 (1.0%)	240 (0.9%)
americacentral	América central	12,763 (0.8%)	377 (1.4%)
chileno	Chileno: Chile, Cuyo	12,569 (0.7%)	295 (1.1%)
filipinas	Español de Filipinas	606 (0.0%)	3 (0.0%)
-	Other	6,956 (0.4%)	228 (0.8%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	933,044 (55.5%)	4,510 (16.7%)
female_feminine	Female, feminine	524,992 (31.2%)	1,821 (6.8%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	15 (0.0%)	3 (0.0%)
-	Unspecified	222,759 (13.3%)	21,318 (79.1%)

Gender declared: 1,458,051 of 1,680,810 clips (86.7%), 5,622 of 26,940 speakers (20.9%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	131,769 (7.8%)	736 (2.7%)
twenties	Twenties	883,287 (52.6%)	2,927 (10.9%)
thirties	Thirties	155,762 (9.3%)	1,272 (4.7%)
fourties	Fourties	46,508 (2.8%)	951 (3.5%)
fifties	Fifties	70,883 (4.2%)	535 (2.0%)
sixties	Sixties	176,849 (10.5%)	174 (0.6%)
seventies	Seventies	791 (0.0%)	28 (0.1%)
eighties	Eighties	251 (0.0%)	4 (0.0%)
nineties	Nineties	128 (0.0%)	5 (0.0%)
-	Unspecified	214,582 (12.8%)	21,078 (78.2%)

Age declared: 1,466,228 of 1,680,810 clips (87.2%), 5,862 of 26,940 speakers (21.8%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	439,225 (26.1%)
Invalidated	95,240 (5.7%)
Other	1,146,345 (68.2%)

Training splits

Split	Clips
Train	359,550 (81.9%)
Dev	15,904 (3.6%)
Test	15,904 (3.6%)

Training split coverage: 391,358 of 439,225 validated clips (89.1%)

The dataset contains 439225 validated, 95240 invalidated, and 1146345 unresolved clips. The average clip duration is 4.881 seconds.

Text corpus

Validated sentences: 1,082,540

Category	Count
Unvalidated sentences	4,921
Pending sentences	3,926
Rejected sentences	995
Reported sentences	2,750

The corpus contains 1,087,461 sentences: 1,082,540 validated and 4,921 unvalidated (3,926 pending review, 995 rejected), with 2,750 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Detalles del lavado quirúrgico.
Jugó de delantero y fue internacional en una ocasión con la selección de Italia.
Además, el edificio es sede de un hostal.
Es un hatchback de cinco puertas con motor delantero y tracción delantera.
Se dice que se llevó a los alemanes del Palacio Real a la picota.

Sources

Source	Sentences
wiki	1,062,101 (98.4%)
sentence-collector	14,245 (1.3%)
Other	3,245 (0.3%)

Text domains

Code	Domain	Clips	Speakers
general	General	49 (0.0%)	28 (0.1%)
agriculture_food	Agriculture and Food	1 (0.0%)	1 (0.0%)
automotive_transport	Automotive and Transport	4 (0.0%)	3 (0.0%)
finance	Finance	6 (0.0%)	4 (0.0%)
service_retail	Service and Retail	3 (0.0%)	3 (0.0%)
healthcare	Healthcare	4 (0.0%)	2 (0.0%)
history_law_government	History, Law and Government	39 (0.0%)	25 (0.1%)
media_entertainment	Media and Entertainment	11 (0.0%)	6 (0.0%)
nature_environment	Nature and Environment	12 (0.0%)	10 (0.0%)
news_current_affairs	News and Current Affairs	19 (0.0%)	13 (0.0%)
technology_robotics	Technology and Robotics	22 (0.0%)	11 (0.0%)
language_fundamentals	Language Fundamentals	8 (0.0%)	5 (0.0%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Spanish

Description

Specifics

Considerations

Processes

Metadata

Español — Spanish (`es`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Spanish

Description

Specifics

Considerations

Processes

Metadata

Español — Spanish (es)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Español — Spanish (`es`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`