License:
CC-BY-4.0
Steward:
CommunityTask: LM
Release Date: 4/16/2026
Format: TXT
Size: 403.16 KB
Share
Şalom Ladino Articles is a monolingual text corpus in Ladino (Judeo-Spanish), compiled from 397 articles published in the Judeo-Espanyol section of Şalom newspaper. The corpus contains 176,843 words, provided as a single segmented and shuffled plain-text file. It was created by Col·lectivaT and the Sephardic Center of Istanbul as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean", funded by the European Union and the Ministry of Culture and Tourism of the Republic of Turkey under the CCH-II grant scheme. The corpus supports NLP research and language technology development for this endangered language.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Attribution required. Users must credit the original source (Şalom newspaper) and cite the associated paper when using this dataset.
Forbidden Usage
Use without attribution to the original creators and source (Şalom newspaper) is not permitted.
Ethical Review
Data collected from copyrighted newspaper articles published by Şalom. Permission was granted through SKAD by Şalom to compile and publish in the form of randomized sentences.
Intended Use
Language modeling, text generation, machine translation training, and NLP research for Ladino (Judeo-Spanish), an endangered language.
This corpus was compiled as part of the project "Judeo-Spanish: Connecting the two ends of the Mediterranean" by Col·lectivaT and the Sephardic Center of Istanbul. The data originates from the Judeo-Espanyol section of Şalom newspaper.
For more datasets published within this initiative check Ladino Data Hub.
If you use this data, please cite:
Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish
@inproceedings{oktem-etal-2022-preparing,
title = "Preparing an endangered language for the digital age: The Case of {J}udeo-{S}panish",
author = {{\"{O}}ktem, Alp and
Zevallos, Rodolfo and
Moslem, Yasmin and
{\"{O}}zt{\"u}rk, {\"{O}}zg{\"u}r G{\"u}ne{\c{s}} and
Gerson {\c{S}}arhon, Karen},
editor = "Ojha, Atul Kr. and
Ahmadi, Sina and
Liu, Chao-Hong and
McCrae, John P.",
booktitle = "Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference",
m jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.eurali-1.18/",
pages = "105--110",
}
This repository is developed as part of project "Judeo-Spanish: Connecting the two ends of the Mediterranean" carried out by Col·lectivaT and Sephardic Center of Istanbul within the framework of the "Grant Scheme for Common Cultural Heritage: Preservation and Dialogue between Turkey and the EU–II (CCH-II)" implemented by the Ministry of Culture and Tourism of the Republic of Turkey with the financial support of the European Union. The content of this website is the sole responsibility of Col·lectivaT and does not necessarily reflect the views of the European Union.
Please check README.md for more information.