Datasets

Filters:
LocaleNLP

English Hausa Parallel Corpus

An English–Hausa dataset with 5,000 sentence pairs useful for machine translation and basic language processing tasks.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: eng, hau

Task Icon

Task: MT

Format Icon

Format: csv

Size Icon

Size: 164.32 KB

Anjuman e Katib

Persian Literature Corpus by Najwai Sukhan

A curated Persian literary corpus of ~1.26M tokens spanning literature, poetry, educational writing, and culturally significant texts.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: fas

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 38.62 MB

Community

Heroes English-Spanish Dubbed Movie Speech Corpus

7000 single speaker speech segments from the original and Spanish dubbed version of 21 episodes of TV series Heroes
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: eng, spa

Task Icon

Task: NLP

Format Icon

Format: wav, csv, txt

Size Icon

Size: 1.68 GB

Common Voice

Common Voice Scripted Speech 25.0 - Swahili

A collection of read speech recordings in Swahili.
License Icon

License: CC0-1.0

Locale Icon

Locale: sw

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 20.87 GB

Common Voice

Common Voice Scripted Speech 25.0 - Kabyle

A collection of read speech recordings in Kabyle.
License Icon

License: CC0-1.0

Locale Icon

Locale: kab

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 17.43 GB

Common Voice

Common Voice Scripted Speech 25.0 - Basque

A collection of read speech recordings in Basque.
License Icon

License: CC0-1.0

Locale Icon

Locale: eu

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 14.48 GB

Common Voice

Common Voice Scripted Speech 25.0 - Japanese

A collection of read speech recordings in Japanese.
License Icon

License: CC0-1.0

Locale Icon

Locale: ja

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 14.34 GB

Common Voice

Common Voice Scripted Speech 25.0 - Luganda

A collection of read speech recordings in Luganda.
License Icon

License: CC0-1.0

Locale Icon

Locale: lg

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 11.06 GB

Common Voice

Common Voice Scripted Speech 25.0 - Czech

A collection of read speech recordings in Czech.
License Icon

License: CC0-1.0

Locale Icon

Locale: cs

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.56 GB

Common Voice

Common Voice Scripted Speech 25.0 - Urdu

A collection of read speech recordings in Urdu.
License Icon

License: CC0-1.0

Locale Icon

Locale: ur

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.78 GB

Common Voice

Common Voice Scripted Speech 25.0 - Georgian

A collection of read speech recordings in Georgian.
License Icon

License: CC0-1.0

Locale Icon

Locale: ka

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 6.37 GB

Common Voice

Common Voice Scripted Speech 25.0 - Thai

A collection of read speech recordings in Thai.
License Icon

License: CC0-1.0

Locale Icon

Locale: th

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 8.38 GB

Common Voice

Common Voice Scripted Speech 25.0 - Russian

A collection of read speech recordings in Russian.
License Icon

License: CC0-1.0

Locale Icon

Locale: ru

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 6.55 GB

Common Voice

Common Voice Scripted Speech 25.0 - Italian

A collection of read speech recordings in Italian.
License Icon

License: CC0-1.0

Locale Icon

Locale: it

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 9.71 GB

Common Voice

Common Voice Scripted Speech 25.0 - Galician

A collection of read speech recordings in Galician.
License Icon

License: CC0-1.0

Locale Icon

Locale: gl

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 7.81 GB

Common Voice

Common Voice Scripted Speech 25.0 - Latvian

A collection of read speech recordings in Latvian.
License Icon

License: CC0-1.0

Locale Icon

Locale: lv

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.84 GB

Common Voice

Common Voice Scripted Speech 25.0 - Persian

A collection of read speech recordings in Persian.
License Icon

License: CC0-1.0

Locale Icon

Locale: fa

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 10.40 GB

Common Voice

Common Voice Scripted Speech 25.0 - Tamil

A collection of read speech recordings in Tamil.
License Icon

License: CC0-1.0

Locale Icon

Locale: ta

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 8.57 GB

Common Voice

Common Voice Scripted Speech 25.0 - Uyghur

A collection of read speech recordings in Uyghur.
License Icon

License: CC0-1.0

Locale Icon

Locale: ug

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 9.69 GB

Common Voice

Common Voice Scripted Speech 25.0 - Kabardian

A collection of read speech recordings in Kabardian.
License Icon

License: CC0-1.0

Locale Icon

Locale: kbd

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 5.52 GB

Common Voice

Common Voice Scripted Speech 25.0 - Frisian

A collection of read speech recordings in Frisian.
License Icon

License: CC0-1.0

Locale Icon

Locale: fy-NL

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 4.34 GB

Common Voice

Common Voice Scripted Speech 25.0 - Welsh

A collection of read speech recordings in Welsh.
License Icon

License: CC0-1.0

Locale Icon

Locale: cy

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.89 GB

Common Voice

Common Voice Scripted Speech 25.0 - Central Kurdish

A collection of read speech recordings in Central Kurdish.
License Icon

License: CC0-1.0

Locale Icon

Locale: ckb

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.59 GB

Common Voice

Common Voice Scripted Speech 25.0 - Hungarian

A collection of read speech recordings in Hungarian.
License Icon

License: CC0-1.0

Locale Icon

Locale: hu

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.58 GB