CoVoST 2 Chinese (China) - English | Mozilla Data Collective

Description

CoVoST 2 is a large-scale multilingual speech to text translation corpus based on Mozilla Common Voice 4.0. This segment of the corpus contains the Chinese (China) audio (24 hours) and the translations in English.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

Research and non-commercial use only.

Forbidden Usage

You agree not to attempt to determine the identity of speakers in this dataset. You agree not to train models for public distribution on this dataset. Any attempt to clone the voice or train models that imitate the speakers in this dataset is forbidden.

Processes

Description

This dataset contains 16829 audio clips totalling 24:16:13 of audio in Chinese (China) with the corresponding translations in English.

Background

End-to-end speech-to-text translation (ST) has recently witnessed an increased interest given its system simplicity, lower inference latency and less compounding errors compared to cascaded ST (i.e. speech recognition + machine translation). End-to-end ST model training, however, is often hampered by the lack of parallel data. Thus, we created CoVoST, a large-scale multilingual ST corpus based on Common Voice, to foster ST research with the largest ever open dataset. Its latest version covers translations from English into 15 languages---Arabic, Catalan, Welsh, German, Estonian, Persian, Indonesian, Japanese, Latvian, Mongolian, Slovenian, Swedish, Tamil, Turkish, Chinese---and from 21 languages into English, including the 15 target languages as well as Spanish, French, Italian, Dutch, Portuguese, Russian. It has total 2,880 hours of speech and is diversified with 78K speakers.

Fields

path: Filename of the audio file
sentence: The sentence in the source language
translation: The sentence in the target language
client_id: The ID of the speaker of the source language, used for maintaining hygiene in the splits.

Example

path	sentence	translation	client_id
common_voice_zh-CN_18536372.mp3	对于更高阶的导数，我们可以继续同样的过程。	For derivatives of higher order, we can use the same process.	c69453a9ae8cdd8ce47e767209e28e254719861de178ac57dfcc4018e0786a51541dda25d2468764d21722b2613ec628063aa15c64aa13dd35a5a5862fadcf04
common_voice_zh-CN_18536373.mp3	乳头凹陷也称为乳头内陷，是指乳头凹陷，未突出乳房的情形。	An inverted nipple is a condition where the nipple, instead of pointing outward, is retracted into the breast.	c69453a9ae8cdd8ce47e767209e28e254719861de178ac57dfcc4018e0786a51541dda25d2468764d21722b2613ec628063aa15c64aa13dd35a5a5862fadcf04
common_voice_zh-CN_18536375.mp3	在很多情况下借词中的被读作而非，甚至在源语言中读作时也如此。	It is often that the loan words are read as exceptions, including time in source language.	c69453a9ae8cdd8ce47e767209e28e254719861de178ac57dfcc4018e0786a51541dda25d2468764d21722b2613ec628063aa15c64aa13dd35a5a5862fadcf04
common_voice_zh-CN_18536377.mp3	身体被安全带拉住	The body is pulled by the seat belt.	c69453a9ae8cdd8ce47e767209e28e254719861de178ac57dfcc4018e0786a51541dda25d2468764d21722b2613ec628063aa15c64aa13dd35a5a5862fadcf04

Splits

	# clips
Train	7086
Dev	4844
Test	4899

Citation

If you use this dataset in your work please cite

@misc{wang2020covost,
    title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
    author={Changhan Wang and Anne Wu and Juan Pino},
    year={2020},
    eprint={2007.10310},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

CoVoST 2 Chinese (China) - English

Description

Specifics

Considerations

Processes

Metadata

Description

Background

Fields

Example

Splits

Citation