License:
CC-BY-4.0
Steward:
CommunityTask: NLP
Release Date: 4/16/2026
Format: OGG, JPEG, TSV
Size: 76.35 MB
Share
"Una fraza al diya" (A Phrase a Day) is a Ladino language learning dataset prepared by Karen Sarhon of the Sephardic Center of Istanbul (SKAD). It consists of 307 sentences in Ladino (Judeo-Spanish) with parallel translations in Spanish, Turkish, and English. The sentences and images were originally published on SKAD's Instagram account (@sephardiccenteristanbul) and extracted using OCR. Audio recordings come from the accompanying web initiative (https://sefarad.com.tr/judeo-espanyolladino/frazadeldia/). The dataset was structured by Col·lectivaT as part of a project to support Ladino in the digital age.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Free to use for any purpose (commercial and non-commercial) with attribution to the original creators. Users must credit Karen Sarhon (Sephardic Center of Istanbul) and Col·lectivaT.
Forbidden Usage
Use without attribution is not permitted. Redistribution without proper credit to the original creators violates the license. No claiming of ownership over the dataset or its contents. No speaker identification or voice cloning from the audio recordings.
Ethical Review
Data sourced from publicly available Instagram posts by the Sephardic Center of Istanbul with the involvement and consent of Karen Sarhon. The dataset supports the preservation of Ladino, a critically endangered language. No sensitive personal information is included.
Intended Use
Ladino language learning and documentation; machine translation training and evaluation (lad↔es, lad↔tr, lad↔en); automatic speech recognition for Ladino; text-to-speech for endangered language preservation; linguistic research on Judeo-Spanish.
Sentences originally published on the Instagram account of the Sephardic Center of Istanbul (@sephardiccenteristanbul). Text and images were extracted using OCR. Audio recordings are from the accompanying web initiative. The dataset was structured by Col·lectivaT as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean". The package includes OGG audio files in clips/, JPEG images in images/, and a metadata.tsv covering all 307 entries (292 with audio, 304 with images). 15 entries are missing audio and 3 are missing images. Please check README.md for more information.
For more datasets published within this initiative check Ladino Data Hub.
If you use this dataset, please cite:
Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish
Preparing an endangered language for the digital age: The Case of Judeo-Spanish. Alp Öktem, Rodolfo Zevallos, Yasmin Moslem, Güneş Öztürk, Karen Şarhon.
Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC 2022. Marseille, France. 20 June 2022
This dataset was developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the “Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)” implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this website is the sole responsibility of Col·lectivaT and does not necessarily reflect the views of the European Union.