Common Voice Spontaneous Speech 4.0 - Irish

Gaeilge — Irish (`ga-IE`)

This datasheet is for sps-corpus-4.0-2026-06-12 of the Mozilla Common Voice Spontaneous Speech dataset for Irish [Gaeilge - ga-IE]. The dataset contains 36 clips representing 0.15 hours of recorded speech (0.01 hours validated) from 4 speakers.

Data splits for modelling

The dataset clips are categorised by transcription status and training-set assignment. The following tables summarise the distribution.

Audio clips

Bucket	Clips	%
Transcribed & Validated	4	11.1%
Transcribed & Pending	5	13.9%
Not transcribed	27	75.0%

Training splits

Bucket	Clips	%
Train	0	0.0%
Dev	0	0.0%
Test	0	0.0%
Unassigned	36	100.0%

Training split coverage: 0 of 4 transcribed & validated clips (0.0%)

Transcriptions

Transcription status

Bucket	Clips	%
Validated	4	44.4%
Pending	5	55.6%
Edited	1	11.1%

Samples

Questions

There follows a randomly selected sample of questions used in the corpus.

An bhfuil sé drochbhéasach "dún do chlab" a rá le duine?
Conas a mholfá do dhaoine Gaeilge a fhoghlaim?
An dtuigeann tú an difear idir go háirithe agus ach go háirithe?
An ciotóg nó an deasóg thú?
Cén saghas ceol beo a sheinntear i do bhaile?

Responses

There follows a randomly selected sample of transcribed responses from the corpus.

Is breá liom an tsraith Modern Family. Tá sé an-chliste agus an-greannmhar.
Téim chuig an lúthlann agus déanaim dreasanna aclaíochta. Is breá liom na meáchain a iompair chun mo mheatáin a neartú agus a mhéadú. Agus freisin, ithim bia atá folláin. Ithim glasraí, torthaí, sailéid. Ólaim uisce agus tá sé sin thar a bheith tábhachtach.
D'fhreastail mé ar chóisir agus bhí spraoi agam i mBaile Átha Cliath.
Sílim go bhfuil buirlí tuí ciorclacha níos fearr mar is féidir iad a bhogadh go héasca agus is féidir spraoi a bheith ag daoine orthu. Is féidir leo léimt in airde orthu agus tosaigh ag siúl agus tosaíonn na buirlí tuí sin ag bogadh.
Níl mórán suim agam i Lá Fhéile Pádraig, ach bíonn sé deas dos na leanaí.

Fields

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
audio_id - numeric id for audio file
audio_file - audio file name
duration_ms - duration of audio in milliseconds
prompt_id - numeric id for prompt
prompt - question for user
transcription - transcription of the audio response
votes - number of people that who approved a given transcript
age - age of the speaker1
gender - gender of the speaker1
language - language name
split - for data modelling, which subset of the data does this clip pertain to
char_per_sec - how many characters of transcription per second of audio
quality_tags - some automated assessment of the transcription--audio pair, separated by |
- transcription-length - character per second under 3 characters per second
- speech-rate - characters per second over 30 characters per second
- short-audio - audio length under 2 seconds
- long-audio - audio length over 5 minutes
- non-allowed-script - transcription contains characters from a writing system not associated with the language
- mixed-script-words - a single word contains characters from multiple writing systems
- mixed-script-transcription - transcription spans multiple writing systems, but each word consistently uses only one

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2

Common Voice Spontaneous Speech 4.0 - Irish

Description

Specifics

Considerations

Processes

Metadata

Gaeilge — Irish (`ga-IE`)

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Spontaneous Speech 4.0 - Irish

Description

Specifics

Considerations

Processes

Metadata

Gaeilge — Irish (ga-IE)

Data splits for modelling

Audio clips

Training splits

Transcriptions

Transcription status

Samples

Questions

Responses

Fields

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Gaeilge — Irish (`ga-IE`)