Common Voice Spontaneous Speech 4.0 - Scots

Loading dataset...

Loading datasets...

Common Voice Spontaneous Speech 4.0 - Scots | Mozilla Data Collective

sco — Scots (`sco`)

This datasheet is for sps-corpus-4.0-2026-06-12 of the Mozilla Common Voice Spontaneous Speech dataset for Scots [sco - sco]. The dataset contains 715 clips representing 11.17 hours of recorded speech (10.68 hours validated) from 21 speakers.

Language

Scots, a sister language of English spoken throughout Scotland, has a long history. It arose from northern English dialects around the 14th century and spread east and northwards, supplanting the indigenous Gaelic language, and developing into a socially and politically high status language with spoken and written norms distinct from those in England. From the 17th century onwards, political, religious and social events led to a loss of status, and thus a shrinking of the domains in which Scots was used, but while English norms replaced Scots in writing, spoken Scots continued to be used. Present day Scots is characterised by the Scots Linguistic Continuum with Standard Scottish English – generally described as being close to Standard English but with an overlay of distinctly Scottish sounds – at one end and Broad Scots – much further from Standard English with its own words, sounds and sentence structures – at the other. In terms of the social profiles, Scottish Standard English is spoken by middle class speakers and in more formal situations such as in schools, while Broad Scots is spoken by working class speakers and in informal situations such as with family and friends. Speakers may styleshift up and down the continuum according to, amongst others, interlocutor and context. At the Broad Scots end of the continuum, there is significant geographic diversity, where, for example, speakers in Glasgow sound very different to speakers in Aberdeen. The speakers in these recordings are from a range of geographic locations, and align more with the Broad Scots end of the continuum.

Data splits for modelling

The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.

Audio clips

Bucket	Clips	%
Transcribed & Validated	680	95.1%
Transcribed & Pending	0	0.0%
Not transcribed	35	4.9%

Training splits

Bucket	Clips	%
Train	452	63.2%
Dev	142	19.9%
Test	86	12.0%
Unassigned	35	4.9%

Training split coverage: 680 of 680 transcribed & validated clips (100.0%)

Transcriptions

The transcription system uses general Latin script.

Prompts: 47
Duration: 40234608[ms]
Avg. Transcription Len: 725
Avg. Duration: 56.27[s]
Valid Duration: 38478.56[s]
Total hours: 11.18[h]
Valid hours: 10.69[h]

Transcription status

Bucket	Clips	%
Validated	680	100.0%
Pending	0	0.0%
Edited	190	27.9%

Writing system

Present day Scots has no written standard and orthographic conventions vary both within and between the different dialects being represented. For example, can’t may be cannae or canny in Edinburgh, but canna in Aberdeen. For these transcriptions, we have followed protocols documented in previous research e.g. https://scotssyntaxatlas.ac.uk

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

What’s the most stressful part of your work?
How do you try and save money?
What was your favourite subject in school and why?
What kind of art do you like?
Who won in a fight with your siblings?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

*Where I live? I live in a block of flats and a’ the neighbours are really nosy, you know? It’s a’, eh, compete against each other with a lot of them. You know, who’s [inc] eh, better than the Jones’ and a’ that stuff- keeping up with the Jones and having a, know “my house has got this, my house has got that”, you know? And then you’re a’ this and that, you know? And I don’t go for a’ that so it really does my nut in- does. Cannae go it. Can’t stand that. And one of the good things aboot staying here is sh- the shops is just- got a big shop just across the road. Fi- two minutes and you’re in it and you can get anything you want which is very handy- very, very handy. And, um, the oth- the other good positive thing aboot it is too- is for transport. You’ve got trains and you’ve got any bus take you anywhere. It’s quite good. It is very good, eh [inc] that way. I suppose a negative thing is, eh- eh basically the neighbours' nosy and I just keep myself to myself, say aye and naw- keep it right, you know? But eh, negative things? Aye, I would say another negative thing is some of the- the weans go by and they’re shouting up and shouting and a’ that and dead noisy and you can maybe want a wee sleep and that at certain times [inc]. It’s awfu’ annoying, you know? Awfu’ annoying and it wakes you up. It’s no good. Disturbs you. *
My opinion on AI I think is it’s genius but it’s also quite scary because it’s like computers- if computers shut doon and you’ve lost all your data, you know, this is a kind of a robotic- I- I’m trying to find how to put it into words with m-m-ake- explaining it, how I feel about it. I just feel it’s genius but there’s also- there’s- there’s a concern aboot it as well that is- it doesnae sit right with me.
*Worst technological advancement, or piece of technology I would say is these daft smartwatches or smertwatches [laugh] if you’re being funny. Smartwatches that, eh, tell you how you’ve slept, cause what a load of keech, that is a load of keech, and folk buy them and think its dead important to ken how you sleep and ken if you’re getting a good rest and a’ that well, what- what do folk do with that information after that, ken what I mean, it tells you didn’t sleep well last night, and then that reprogrammes folk’s minds and they think right well I’ve no slept well so I’ll be tired the day, rather than just seeing how they actually feel for the day. So smartwatches, load of keech *
I don’t really know what my favourite film is but- like thinking about it, intellectually and carefully but like eh, probably the film that I’m stirred to say most immediately- to pick out most immediately, the one I’ve watched- rewatched the most is Aliens. The nineteen-eighties the second of the Alien movies, just because it’s the most ridiculously endlessly thrilling action film of that era. It’s so perfectly realised and Sigourney Weaver is just this incredible badass that just keeps getting better throughout the film and just keeps- just culminates in these incredible final scenes, and even kind of the final scene of the final scene, the kind of, when the alien kind of fights wi- when Ripley fights the alien at the end, it’s just- no matter how old I am, no matter how many times I’ve seen that, the moment she kind of emerges and goes “get away from her you bitch” It’ll never never in my life not induce the kind of thrill that it induced in me like twenty-something years ago when I first watched it. I cannot wait to watch it with my son- the older son because he’s just going to love it so much, but you know it’s a pretty gory film so there’s a couple of ripe old moments, so it’s not something I’m going to hurry him into watching anytime soon.
I am so proud of both my daughters, I feel they’re both grafters, and they are so true to theirself, they speak their minds, and they work hard for what they’ve got, but at the same time they are very good people, they give out a lot of love, and receive a lot of love, and you can see that amongst their friends, and like myself they have had friends fae as far back as like nursery and school, and still see them, and I find that’s very very important.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 5 minutes
- non-allowed-script - transcription contains characters from a writing system not associated with the language
- mixed-script-words - a single word contains characters from multiple writing systems
- mixed-script-transcription - transcription spans multiple writing systems, but each word consistently uses only one

Get involved

Community links

Discussions

Contribute

Acknowledgements

Datasheet authors

Jennifer Smith <jennifer.smith@glasgow.ac.uk>

Funding

This dataset was partially funded by the Open Multilingual Speech Fund managed by Mozilla Common Voice.

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2

Common Voice Spontaneous Speech 4.0 - Scots

Description

Specifics

Considerations

Processes

Metadata

sco — Scots (`sco`)

Language

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Writing system

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Acknowledgements

Datasheet authors

Funding

Licence

Footnotes

Common Voice Spontaneous Speech 4.0 - Scots

Description

Specifics

Considerations

Processes

Metadata

sco — Scots (sco)

Language

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Writing system

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Acknowledgements

Datasheet authors

Funding

Licence

Footnotes

sco — Scots (`sco`)