Task: NLP
Release Date: 4/15/2026
Format: TXT
Size: 310.39 MB
Share
This dataset serves as a comprehensive longitudinal news corpus for the Serbian, Bosnian, and Montenegrin languages, sourced from Radio Slobodna Evropa (slobodnaevropa.org), the Balkan service of Radio Free Europe/Radio Liberty (RFE/RL). Spanning from December 2003 to March 2026, the corpus contains 389,883 unique articles, totaling over 24 million tokens. Due to the unified nature of this regional desk covering Serbia, Bosnia and Herzegovina, and Montenegro, articles originally detected as Serbian, Croatian, or Bosnian have been intentionally aggregated into a single combined dataset. The corpus also preserves a small subset of English articles published alongside the regional content. The files are formatted as plain text with YAML front-matter metadata, making them ready for linguistic analysis, search indexing, and cultural preservation research.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
When using RFE/RL text content in full, we require that you credit RFE/RL by including: • A permanent link, placed before the text of the article, to the original article on www.rferl.org or www.slobodnaevropa.org/ • The following text in the article: Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty You must also refrain from altering or distorting the meaning, name, or integrity of the product. When using excerpts of RFE/RL text content, we require that you note that the material is an excerpt and link to the original content somewhere in your text. When translating RFE/RL text content into another language, we require that you note that the material is a translation, state the original language of the text, and provide a link to the original content somewhere in your text. The translated text must not alter or distort the meaning, name, or integrity of the content.
Forbidden Usage
• The sale of RFE/RL content is prohibited. • The use of RFE/RL content in advertisements or endorsements is prohibited. • The use of RFE/RL content to train artificial intelligence (AI) systems is prohibited.
This corpus was extracted from the archives of Radio Slobodna Evropa (slobodnaevropa.org), the Balkan Service of Radio Free Europe/Radio Liberty covering Serbia, Bosnia and Herzegovina, and Montenegro.
Total Articles: 389883
Time Period: 2003-12 to 2026-03
Languages:
Serbian/Bosnian/Montenegrin (sh): 389749 articles (~24281317 tokens)
English (en): 134 articles (~135515 tokens)
Note on Processing & Multilingual Content:
Language Detection: The language of each article was identified
using the pycld2 Python package. Because this RFE/RL bureau operates
across Serbia, Bosnia, and Montenegro, pycld2 naturally scattered
the text across Serbian (sr), Croatian (hr), and Bosnian (bs).
These have been intentionally aggregated into a single sh file for
this dataset to reflect the unified nature of the desk. The internal
YAML front matter will still display the original CLD2 tag.
English Articles: Any English articles published alongside the
regional content have been preserved (slobodnaevropa.en.txt).
Paragraph Structure: Paragraph breaks from the original HTML were preserved to the extent possible.
Formatting: Text has been wrapped at 80 characters for easier inspection in terminal environments. This wrapping is done strictly on whitespace; no words were split or chunked apart.
The dataset is provided as text files based on detected language:
slobodnaevropa.sh.txt
slobodnaevropa.en.txt
Inside the files, each article is delimited by a YAML Front Matter block containing metadata, followed by the full article text.
url: The canonical URL of the original article.
title: The headline of the article.
date: Publication date (ISO 8601 format: YYYY-MM-DD).
script: The writing system used (latn or cyrl).
lang: The detected language code (often sr, hr, or bs).
All content is the property of RFE/RL, Inc. and is protected by U.S. and international copyright laws.
Users of this dataset must adhere to the RFE/RL Terms of Use. Specifically, users must credit RFE/RL in any reuse:
Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty.