Common Voice Scripted Speech 26.0 - Sorbian, Upper

Hornjoserbšćina — Sorbian, Upper (`hsb`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for Sorbian, Upper [Hornjoserbšćina - hsb]. The dataset contains 5952 clips representing 12.86 hours of recorded speech (3.59 hours validated) from 34 speakers, recorded from a text corpus of 7,722 sentences.

Language

Accents

Code	Accent	Clips	Speakers
-		10 (0.2%)	1 (2.9%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	1,495 (25.1%)	12 (35.3%)
female_feminine	Female, feminine	1,009 (17.0%)	4 (11.8%)
transgender	Transgender	-	-
non-binary	Non-binary	-	-
do_not_wish_to_say	Prefer not to say	-	-
-	Unspecified	3,448 (57.9%)	20 (58.8%)

Gender declared: 2,504 of 5,952 clips (42.1%), 14 of 34 speakers (41.2%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	1,039 (17.5%)	3 (8.8%)
twenties	Twenties	201 (3.4%)	2 (5.9%)
thirties	Thirties	175 (2.9%)	1 (2.9%)
fourties	Fourties	953 (16.0%)	7 (20.6%)
fifties	Fifties	126 (2.1%)	2 (5.9%)
sixties	Sixties	40 (0.7%)	2 (5.9%)
seventies	Seventies	55 (0.9%)	2 (5.9%)
eighties	Eighties	-	-
nineties	Nineties	-	-
-	Unspecified	3,363 (56.5%)	18 (52.9%)

Age declared: 2,589 of 5,952 clips (43.5%), 16 of 34 speakers (47.1%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	1,663 (27.9%)
Invalidated	252 (4.2%)
Other	4,037 (67.8%)

Training splits

Split	Clips
Train	812 (48.8%)
Dev	350 (21.0%)
Test	501 (30.1%)

Training split coverage: 1,663 of 1,663 validated clips (100.0%)

The dataset contains 1663 validated, 252 invalidated, and 4037 unresolved clips. The average clip duration is 7.783 seconds.

Text corpus

Validated sentences: 7,192

Category	Count
Unvalidated sentences	530
Pending sentences	504
Rejected sentences	26
Reported sentences	128

The corpus contains 7,722 sentences: 7,192 validated and 530 unvalidated (504 pending review, 26 rejected), with 128 reported for review.

Sample

There follows a randomly selected sample of five sentences from the corpus.

Tohorunja žada so statny spěchowanski program za strukturnu změnu we Łužicy.
Najebać to pěstuja wobydlerjo w kulturnym a sportowym towarstwje starodawne nałožki.
Za požadanja rewizijow maja so wuměnjenja za dohlad do podłožkow tworić.
Tutu składnosć wužiwachu mjez druhim Asturjenjo, Baskojo, Frijawlojo, Wuchodni Frizojo a Serbja.
Chcu tebje ekskomunikować, dokelž njejsy prawidła cyrkwje dodźeržał.

Sources

Source	Sentences
sentence-collector	7,171 (99.7%)
Other	21 (0.3%)

Text domains

Code	Domain	Clips	Speakers
general	General	-	-
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	-	-
finance	Finance	-	-
service_retail	Service and Retail	-	-
healthcare	Healthcare	-	-
history_law_government	History, Law and Government	-	-
media_entertainment	Media and Entertainment	-	-
nature_environment	Nature and Environment	-	-
news_current_affairs	News and Current Affairs	2 (0.0%)	1 (2.9%)
technology_robotics	Technology and Robotics	-	-
language_fundamentals	Language Fundamentals	-	-

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - Sorbian, Upper

Description

Specifics

Considerations

Processes

Metadata

Hornjoserbšćina — Sorbian, Upper (`hsb`)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - Sorbian, Upper

Description

Specifics

Considerations

Processes

Metadata

Hornjoserbšćina — Sorbian, Upper (hsb)

Language

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Hornjoserbšćina — Sorbian, Upper (`hsb`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`