Datasets
RFE/RL Ukrainian News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: uk,ru
Task: NLP
Format: TXT
Size: 591.97 MB
Synthetic Text Corpus for African Language ASR
License: CC-BY-NC-4.0
Locale: bm,ny,ha,kr,luo
Task: NLP
Format: TSV
Size: 746.63 KB
Kaler Kantho Bengali Newspaper Corpus
License: CC-BY-NC-4.0
Locale: ben
Task: NLP
Format: DOCX
Size: 33.11 MB
Marma Text Corpus
License: CC-BY-NC-SA-4.0
Locale: rmz
Task: LM
Format: TSV
Size: 188.92 KB
Prothom Alo Bengali Newspaper Corpus
License: CC-BY-NC-4.0
Locale: ben
Task: NLP
Format: DOCX
Size: 42.36 MB
RFE/RL Uzbek News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: uz
Task: NLP
Format: TXT
Size: 154.21 MB
RFE/RL Romanian (Romania) News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: ro
Task: NLP
Format: TXT
Size: 77.95 MB
Hindi 10 Million Text Corpus
License: CC-BY-ND-4.0
Locale: hin
Task: NLP
Format: DOCX
Size: 21.91 MB
The Daily Jugantor Bengali Language Corpus
License: CC-BY-NC-SA-4.0
Locale: ben
Task: NLP
Format: DOCX
Size: 40.49 MB
CV Korean Test 25.0 - Noise-Augmented (SCAI)
License: CC0-1.0
Locale: ko
Task: ASR
Format: MP3, JSONL
Size: 21.01 MB
IBT Torwali Literature Corpus
License: CC-BY-NC-4.0
Locale: trw
Task: NLP
Format: TXT
Size: 488.12 KB
Bulu_ALCAM-MultimodalDataset
License: NOODL-1.0
Locale: bum
Task: NLP
Format: MP3, TSV
Size: 31.28 MB
Hausa-TTS-Dataset
License: NOODL-1.0
Locale: hau
Task: TTS
Format: MP3, TSV
Size: 276.90 MB
Tamil Time Aligned Speech Dataset
License: CC-BY-NC-SA-4.0
Locale: tam
Task: ASR
Format: OGG, SRT
Size: 37.11 MB
ViQua² — Visual Question-answering about Quantities
License: CC-BY-SA-4.0
Locale: en-US
Task: CV
Format: JSON, JPEG
Size: 281.05 MB
Bamun-TTS-Dataset
License: NOODL-1.0
Locale: bax
Task: TTS
Format: MP3, TSV
Size: 219.97 MB
Territórios Digitais
License: CC-BY-4.0
Locale: pt, en
Task: N/A
Format: DOCX, PDF, XLSX
Size: 4.24 MB
Chuvash TTS
License: CC-BY-SA-4.0
Locale: cv
Task: TTS
Format: PARQUET
Size: 854.02 MB
RFE/RL Persian News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: fa
Task: NLP
Format: TXT
Size: 307.78 MB
Saraiki 10 Hours TTS Dataset
License: CC-BY-NC-SA-4.0
Locale: srk
Task: TTS
Format: WEBM, TSV
Size: 584.44 MB
Kannada Time Aligned Speech Corpus
License: CC-BY-NC-SA-4.0
Locale: kan
Task: ASR
Format: OGG, SRT
Size: 355.77 MB
Sentence translation difficulty in Spanish - BOUQuET
License: CC-BY-SA-4.0
Locale: es
Task: MT
Format: TSV
Size: 81.48 KB
Yezoum_ALCAM-MultimodalDataset
License: NOODL-1.0
Locale: ewo
Task: NLP
Format: MP3, TSV
Size: 12.81 MB
Common Voice Spontaneous Speech 3.0 - Serian Bidayuh
License: CC0-1.0
Locale: sdo
Task: ASR
Format: MP3
Size: 201.26 MB