Ladino TTS Corpus

Description

This dataset is a single-speaker speech corpus in Ladino, recorded by Karen Şarhon, a native speaker from Istanbul. Ladino (also called Judeo-Spanish or Judezmo, ISO 639-3: lad) is a descendant of the old Castilian Spanish of the 15th century — essentially the medieval Spanish spoken at the time of the 1492 expulsion of the Sephardic Jews, which evolved separately over 530 years in the Ottoman Empire and is today classified as severely endangered by UNESCO. The corpus was built by having the native speaker read 30 articles from the weekly newspaper El Amaneser (the only newspaper published entirely in Judeo-Spanish), covering historical issues, current affairs, cultural events and politics. The recordings were automatically aligned and manually verified, then segmented into 1,987 clips (16 kHz, 16-bit, mono WAV) totalling approximately 2 hours 15 minutes of speech, each paired with its transcription. The corpus was created for training text-to-speech synthesis models and was structured and curated by Col·lectivaT as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean", funded by the European Union via the CCH-II Grant Scheme.

This dataset is a single-speaker Ladino speech corpus recorded by Karen Şarhon, a native speaker from Istanbul. Ladino (Judeo-Spanish / Judezmo, ISO 639-3: lad) is a descendant of the old Castilian Spanish of the 15th century that evolved separately for over five centuries after the 1492 expulsion of the Sephardic Jews. The native speaker read 30 articles from the weekly newspaper El Amaneser; the recordings were automatically aligned (using a Coqui Speech-to-Text Spanish model) and manually verified, then segmented into 1,987 clips totalling approximately 2 hours 15 minutes of speech, each paired with its transcription. The package includes WAV audio files (16 kHz, 16-bit, mono) under clips/train/ and a metadata.tsv covering all 1,987 entries with columns: path, split, audio_id, text, and duration.

The dataset was structured and curated by Col·lectivaT as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean", carried out together with the Sephardic Center of Istanbul.

A Glow-TTS speech synthesis model trained on this dataset is available at collectivat/ladino-tts.

HuggingFace dataset: https://huggingface.co/datasets/collectivat/ladino-karen-TTS

For other datasets resulting from this project visit the Ladino Data Hub.

Citation

If you use this dataset, please cite:

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

Disclaimer

This dataset was developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the "Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)" implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this dataset is the sole responsibility of Col·lectivaT and Sephardic Center of Istanbul and does not necessarily reflect the views of the European Union.

Please check README.md for more information.

Description

Specifics

Considerations

Processes

Metadata

Citation

Disclaimer