RFE/RL Macedonian News Text Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

RFE/RL

Task: NLP

Release Date: 4/14/2026

Format: TXT

Size: 133.95 MB


Share

Description

This dataset serves as a comprehensive longitudinal news corpus for the Macedonian language, sourced from Radio Slobodna Evropa (slobodnaevropa.mk), the Macedonian service of Radio Free Europe/Radio Liberty (RFE/RL). Spanning from May 2002 to March 2026, the corpus contains 204,934 unique articles, totaling over 46 million tokens. The dataset captures the historical record and linguistic evolution of the region over more than two decades. Note that due to automated language detection limitations with closely related Cyrillic-script Slavic languages, valid Macedonian articles occasionally misclassified as Serbian or Bulgarian have been verified and intentionally aggregated into this unified dataset. The file is formatted as plain text with YAML front-matter metadata, making it ready for linguistic analysis, search indexing, and cultural preservation research.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Considerations

Restrictions/Special Constraints

When using RFE/RL text content in full, we require that you credit RFE/RL by including: • A permanent link, placed before the text of the article, to the original article on www.rferl.org or www.slobodnaevropa.mk • The following text in the article: Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty You must also refrain from altering or distorting the meaning, name, or integrity of the product. When using excerpts of RFE/RL text content, we require that you note that the material is an excerpt and link to the original content somewhere in your text. When translating RFE/RL text content into another language, we require that you note that the material is a translation, state the original language of the text, and provide a link to the original content somewhere in your text. The translated text must not alter or distort the meaning, name, or integrity of the content.

Forbidden Usage

• The sale of RFE/RL content is prohibited. • The use of RFE/RL content in advertisements or endorsements is prohibited. • The use of RFE/RL content to train artificial intelligence (AI) systems is prohibited.

Metadata

RFE/RL Macedonian News Text Corpus (2002–2026)

Overview

This corpus was extracted from the archives of Radio Slobodna Evropa (slobodnaevropa.mk), the Macedonian Service of Radio Free Europe/Radio Liberty.

Statistics

  • Total Articles: 204934

  • Time Period: 2002-05 to 2026-03

  • Languages:

    • Macedonian (mk): 204934 articles (~46365739 tokens)

Note on Processing:

  • Language Detection: The language of each article was identified using the pycld2 Python package (version 0.42). Because Macedonian shares significant vocabulary and the Cyrillic script with neighboring Slavic languages, pycld2 frequently misclassified valid Macedonian articles as Serbian (sr) or Bulgarian (bg). These have been intentionally aggregated back into a single mk file for this dataset, as manual inspection confirms they are standard Macedonian. The internal YAML front matter will still display the original CLD2 tag.

  • Paragraph Structure: Paragraph breaks from the original HTML were preserved to the extent possible.

  • Formatting: Text has been wrapped at 80 characters for easier inspection in terminal environments. This wrapping is done strictly on whitespace; no words were split or chunked apart.

Data Format

The dataset is provided as a text file:

  • slobodnaevropa.mk.txt

Inside the file, each article is delimited by a YAML Front Matter block containing metadata, followed by the full article text.

Metadata Fields

  • url: The canonical URL of the original article.

  • title: The headline of the article.

  • date: Publication date (ISO 8601 format: YYYY-MM-DD).

  • script: The writing system used (cyrl for Cyrillic).

  • lang: The detected language code (often sr or bg due to CLD2 misclassification, but the text is Macedonian).

Source & License

All content is the property of RFE/RL, Inc. and is protected by U.S. and international copyright laws.

Users of this dataset must adhere to the RFE/RL Terms of Use. Specifically, users must credit RFE/RL in any reuse:

Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty.