Common Voice Scripted Speech 26.0 - Swedish

Svenska — Swedish (`sv-SE`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Swedish [Svenska - sv-SE]. The dataset contains 50259 clips representing 56.18 hours of recorded speech (47.73 hours validated) from 889 speakers, recorded from a text corpus of 33,472 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		4,132 (8.2%)	57 (6.4%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	23,521 (46.8%)	218 (24.5%)
female_feminine	Female, feminine	16,044 (31.9%)	34 (3.8%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	10,694 (21.3%)	714 (80.3%)

Gender declared: 39,565 of 50,259 clips (78.7%), 175 of 889 speakers (19.7%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	1,548 (3.1%)	17 (1.9%)
twenties	Twenties	5,953 (11.8%)	99 (11.1%)
thirties	Thirties	12,745 (25.4%)	86 (9.7%)
fourties	Fourties	18,706 (37.2%)	46 (5.2%)
fifties	Fifties	1,634 (3.3%)	19 (2.1%)
sixties	Sixties	50 (0.1%)	5 (0.6%)
seventies	Seventies	35 (0.1%)	1 (0.1%)
eighties	Eighties	-	-
nineties	Nineties	20 (0.0%)	1 (0.1%)
-	Unspecified	9,568 (19.0%)	699 (78.6%)

Age declared: 40,691 of 50,259 clips (81.0%), 190 of 889 speakers (21.4%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	42,707 (85.0%)
Invalidated	1,601 (3.2%)
Other	5,951 (11.8%)

Training splits

Split	Clips
Train	8,272 (19.4%)
Dev	5,514 (12.9%)
Test	5,516 (12.9%)

Training split coverage: 19,302 of 42,707 validated clips (45.2%)

The dataset contains 42707 validated, 1601 invalidated, and 5951 unresolved clips. The average clip duration is 4.024 seconds.

Text corpus

Validated sentences: 26,878

Category	Count
Unvalidated sentences	6,594
Pending sentences	6,447
Rejected sentences	147
Reported sentences	598

The corpus contains 33,472 sentences: 26,878 validated and 6,594 unvalidated (6,447 pending review, 147 rejected), with 598 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Uppgifterna kan då normalt tillhandahållas utan någon föregående bearbetning av dem.
Stödet för lila trafikljus minskar.
Han låg kvar i sängen och lyssnade till regnet utanför.
Lägg det på hans säng.
Tills vi ses igen.

Sources

Source	Sentences
sentence-collector	24,303 (90.4%)
Project Gutenberg, with slight tweaks from me.	1,313 (4.9%)
covost2-en_sv-SE	1,205 (4.5%)
Other	57 (0.2%)

Text domains

Code	Domain	Clips	Speakers
general	General	4 (0.0%)	3 (0.3%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	-	-
finance	Finance	-	-
service_retail	Service and Retail	-	-
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	-	-
media_entertainment	Media and Entertainment	1 (0.0%)	1 (0.1%)
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	-	-
technology_robotics	Technology and Robotics	-	-
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Swedish

Description

Specifics

Considerations

Processes

Metadata

Svenska — Swedish (`sv-SE`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Swedish

Description

Specifics

Considerations

Processes

Metadata

Svenska — Swedish (sv-SE)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Svenska — Swedish (`sv-SE`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`