CoVoST 2 English-Slovenian

Description

CoVoST 2 is a large-scale multilingual speech to text translation corpus based on Mozilla Common Voice 4.0. This segment of the corpus contains the English audio and the translations in Slovenian.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Research and non-commercial use only

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset. You agree not to train models for public distribution on this dataset. Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden

Processes

Ethical Review

Description

End-to-end speech-to-text translation (ST) has recently witnessed an increased interest given its system simplicity, lower inference latency and less compounding errors compared to cascaded ST (i.e. speech recognition + machine translation). End-to-end ST model training, however, is often hampered by the lack of parallel data. Thus, we created CoVoST, a large-scale multilingual ST corpus based on Common Voice, to foster ST research with the largest ever open dataset. Its latest version covers translations from English into 15 languages---Arabic, Catalan, Welsh, German, Estonian, Persian, Indonesian, Japanese, Latvian, Mongolian, Slovenian, Swedish, Tamil, Turkish, Chinese---and from 21 languages into English, including the 15 target languages as well as Spanish, French, Italian, Dutch, Portuguese, Russian. It has total 2,880 hours of speech and is diversified with 78K speakers.

Fields

path: Filename of the audio file
sentence: The sentence in the source language
translation: The sentence in the target language
client_id: The ID of the speaker of the source language, used for maintaining hygiene in the splits.

Example

path	sentence	translation	client_id
common_voice_en_18540003.mp3	When water is scarce, avoid wasting it.	Varčuj z vodo, ko je primanjkuje.	d277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658
common_voice_en_18540005.mp3	You will drive with her to her door.	Z njo se boš peljal do njenih vrat.	d277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658
common_voice_en_18540006.mp3	Celia shrank back, shivering.	Celia je skočila nazaj in drhtela.	d277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658
common_voice_en_65557.mp3	Have you got a ring?	Imaš prstan?	d28566a5d710dbd7e6c2ab4686ad5bd22ec86588a3abd11cefe0e93182e39a6f9da80550916fadb13e5ef051c7819e5aa5fc2e0ebbddc1b847c14926106c3fe3

Citation

If you use this dataset in your work please cite

@misc{wang2020covost,
    title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
    author={Changhan Wang and Anne Wu and Juan Pino},
    year={2020},
    eprint={2007.10310},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Description

Specifics

Considerations

Processes

Metadata

Description

Fields

Example

Citation