License:
CC-BY-4.0
Steward:
CommunityDataset ID:
cmpbmhj4i0067nw07tk46v2jp
Task: MT
Release Date: 5/18/2026
Format: TSV
Size: 898.32 MB
Share
This dataset contains over 20 million synthetic parallel sentence pairs for Ladino (Judeo-Spanish) paired with English (5.7M pairs), Spanish (10.3M pairs), and Turkish (4.6M pairs). The data was generated using rule-based Spanish-Ladino translation methods to support the preservation and digital development of this endangered language. Created by Col·lectivaT as part of the "Judeo-Spanish: Connecting the two ends of the Mediterranean" project.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Attribution must be given to the original creators.
Forbidden Usage
Use without proper attribution to the original creators.
Ethical Review
The data is synthetically generated using rule-based translation methods and not verified by Ladino linguists.
Intended Use
Training machine translation models for Ladino, an endangered language. Since the data is synthetically generated, it is not recommended for evaluation purposes.
This dataset was developed as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul. The parallel data was synthetically generated using the Espanyol-Ladino rule-based translation engine. Each language pair is provided as a separate TSV file with columns: id, source language text, Ladino text, corpus_id, and src_2.
Note: Since this data is synthetically generated via rule-based translation, it does not correspond to actual Ladino as used by native speakers. It is intended for training purposes only and is not recommended for evaluation.
Translation models trained with this data are available at collectivat/ladino-MT-models, and a live translation application is available at translate.sefarad.com.tr.
For other datasets resulting from this project visit Ladino Data Hub.
If you use this data, please cite:
Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish
@article{Oktem2022,
author = {Öktem, Alp and Zevallos, Rodolfo and Moslem, Yasmin and Öztürk, Güneş and Şarhon, Karen},
title = {Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish},
journal = {arXiv preprint arXiv:2205.15599},
year = {2022},
doi = {10.48550/arXiv.2205.15599}
}
This dataset was developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the "Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)" implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this repository is the sole responsibility of Col·lectivaT and does not necessarily reflect the views of the European Union.
Please check README.md for more information.