Synthetic Ladino Parallel Corpus

This dataset was developed as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul. The parallel data was synthetically generated using the Espanyol-Ladino rule-based translation engine. Each language pair is provided as a separate TSV file with columns: id, source language text, Ladino text, corpus_id, and src_2.

Note: Since this data is synthetically generated via rule-based translation, it does not correspond to actual Ladino as used by native speakers. It is intended for training purposes only and is not recommended for evaluation.

Translation models trained with this data are available at collectivat/ladino-MT-models, and a live translation application is available at translate.sefarad.com.tr.

For other datasets resulting from this project visit Ladino Data Hub.

Citation

If you use this data, please cite:

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

@article{Oktem2022,
  author = {Öktem, Alp and Zevallos, Rodolfo and Moslem, Yasmin and Öztürk, Güneş and Şarhon, Karen},
  title = {Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish},
  journal = {arXiv preprint arXiv:2205.15599},
  year = {2022},
  doi = {10.48550/arXiv.2205.15599}
}

Disclaimer

This dataset was developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the "Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)" implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this repository is the sole responsibility of Col·lectivaT and does not necessarily reflect the views of the European Union.

Please check README.md for more information.

Description

Specifics

Considerations

Processes

Metadata

Citation

Disclaimer