VoxForge - Catalan | Mozilla Data Collective

Description

39 minutes of Catalan read speech, collected as part of the VoxForge project.

Specifics

Licensing

GNU General Public License v3.0 or later (GPL-3.0)

https://spdx.org/licenses/GPL-3.0-or-later.html

Considerations

Restrictions/Special Constraints

N/A

Forbidden Usage

N/A

Processes

Intended Use

ASR training and evaluation

Metadata

Voice data contributed by volunteers who read prompts out loud. For Catalan, there is just over 1 hour of recorded speech.

The following is a breakdown of the number of utterances per speaker (of course, "anonymous" likely makes up multiple speakers):

Speaker	Count
anonymous	128
duhow	80
Guillem	60
RainCT	60
Pere	30
hseara	20
rain	20
Kyngo	10
RogerR	10

Dataset format

The top-level directory contains a number of subdirectories corresponding to speaker/session recorded. Each of these subdirectories is structured as follows:

├── wav/
│   ├── file1.wav
│   ├── file2.wav
│   ├── ...
├── etc/
│   ├── GPL_license.txt  
│   ├── PROMPTS  
│   ├── prompts-original  
│   ├── README

where PROMPTS and prompts-original contain an audio id followed by a space and the prompt text (transcript).

See https://www.voxforge.org/home/about for more details about the project and dataset.