License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmqz8qad200hzmp070fvxru9n
Task: TTS
Release Date: 6/29/2026
Format: WAV, TSV
Size: 196.00 MB
Share
This dataset is a single-speaker speech corpus in Ladino, recorded by Karen Şarhon, a native speaker from Istanbul. Ladino (also called Judeo-Spanish or Judezmo, ISO 639-3: lad) is a descendant of the old Castilian Spanish of the 15th century — essentially the medieval Spanish spoken at the time of the 1492 expulsion of the Sephardic Jews, which evolved separately over 530 years in the Ottoman Empire and is today classified as severely endangered by UNESCO. The corpus was built by having the native speaker read 30 articles from the weekly newspaper El Amaneser (the only newspaper published entirely in Judeo-Spanish), covering historical issues, current affairs, cultural events and politics. The recordings were automatically aligned and manually verified, then segmented into 1,987 clips (16 kHz, 16-bit, mono WAV) totalling approximately 2 hours 15 minutes of speech, each paired with its transcription. The corpus was created for training text-to-speech synthesis models and was structured and curated by Col·lectivaT as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean", funded by the European Union via the CCH-II Grant Scheme.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
Non-commercial research and educational use only, with attribution to the original creators (Col·lectivaT and Sephardic Center of Istanbul).
Forbidden Usage
No commercial use. No voice impersonation, voice cloning, or deepfake creation. No use that could harm the speaker or the Ladino community. No use without attribution. Any synthetic speech generated from models trained on this data must be clearly identified as synthetic.
Ethical Review
Use is restricted to non-commercial purposes; voice impersonation and deepfake creation are forbidden, and any synthetic speech generated must be clearly identified as synthetic.
Intended Use
Training and evaluation of text-to-speech (TTS) synthesis models for Ladino; automatic speech recognition for Ladino; speech technology research and documentation supporting the revitalisation of an endangered language.
This dataset is a single-speaker Ladino speech corpus recorded by Karen Şarhon, a native speaker from Istanbul. Ladino (Judeo-Spanish / Judezmo, ISO 639-3: lad) is a descendant of the old Castilian Spanish of the 15th century that evolved separately for over five centuries after the 1492 expulsion of the Sephardic Jews. The native speaker read 30 articles from the weekly newspaper El Amaneser; the recordings were automatically aligned (using a Coqui Speech-to-Text Spanish model) and manually verified, then segmented into 1,987 clips totalling approximately 2 hours 15 minutes of speech, each paired with its transcription. The package includes WAV audio files (16 kHz, 16-bit, mono) under clips/train/ and a metadata.tsv covering all 1,987 entries with columns: path, split, audio_id, text, and duration.
The dataset was structured and curated by Col·lectivaT as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean", carried out together with the Sephardic Center of Istanbul.
A Glow-TTS speech synthesis model trained on this dataset is available at collectivat/ladino-tts.
HuggingFace dataset: https://huggingface.co/datasets/collectivat/ladino-karen-TTS
For other datasets resulting from this project visit the Ladino Data Hub.
If you use this dataset, please cite:
Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish
This dataset was developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the "Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)" implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this dataset is the sole responsibility of Col·lectivaT and Sephardic Center of Istanbul and does not necessarily reflect the views of the European Union.
Please check README.md for more information.