RFE/RL Hungarian News Text Corpus
License:
CC-BY-NC-SA-4.0
Steward:
RFE/RLTask: NLP
Release Date: 4/13/2026
Format: TXT
Size: 36.64 MB
Share
Description
This dataset serves as a complete historical news corpus for the Hungarian language, sourced from Szabad Európa (szabadeuropa.hu), the Hungarian service of Radio Free Europe/Radio Liberty (RFE/RL). Because the RFE/RL Hungarian service officially ceased operations on November 21, 2025, this dataset represents a complete, finalized archive of its modern iteration. Spanning from August 2020 to November 2025, the corpus contains 18,494 unique articles, totaling over 12.4 million tokens. The file is formatted as plain text with YAML front-matter metadata, making it ready for linguistic analysis, search indexing, and cultural preservation research.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlConsiderations
Restrictions/Special Constraints
When using RFE/RL text content in full, we require that you credit RFE/RL by including: • A permanent link, placed before the text of the article, to the original article on www.rferl.org or www.szabadeuropa.hu • The following text in the article: Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty You must also refrain from altering or distorting the meaning, name, or integrity of the product. When using excerpts of RFE/RL text content, we require that you note that the material is an excerpt and link to the original content somewhere in your text. When translating RFE/RL text content into another language, we require that you note that the material is a translation, state the original language of the text, and provide a link to the original content somewhere in your text. The translated text must not alter or distort the meaning, name, or integrity of the content.
Forbidden Usage
• The sale of RFE/RL content is prohibited. • The use of RFE/RL content in advertisements or endorsements is prohibited. • The use of RFE/RL content to train artificial intelligence (AI) systems is prohibited.
Metadata
RFE/RL Hungarian News Text Corpus (2020–2025)
Overview
This corpus was extracted from the archives of Szabad Európa (szabadeuropa.hu), the Hungarian Service of Radio Free Europe/Radio Liberty.
(Note: The RFE/RL Hungarian service officially ceased operations on November 21, 2025, making this dataset a complete historical archive of its modern iteration).
Statistics
Total Articles: 18494
Time Period: 2020-08 to 2025-11
Languages:
Hungarian (
hu): 18494 articles (~12485502 tokens)
Note on Processing:
Language Detection: The language of each article was identified automatically using the
pycld2Python package (version 0.42).Paragraph Structure: Paragraph breaks from the original HTML were preserved to the extent possible.
Formatting: Text has been wrapped at 80 characters for easier inspection in terminal environments. This wrapping is done strictly on whitespace; no words were split or chunked apart.
Data Format
The dataset is provided as a text file:
szabadeuropa.hu.txt
Inside the file, each article is delimited by a YAML Front Matter block containing metadata, followed by the full article text.
Metadata Fields
url: The canonical URL of the original article.title: The headline of the article.date: Publication date (ISO 8601 format: YYYY-MM-DD).script: The writing system used (latnfor Latin).lang: The detected language code (hufor Hungarian).
Source & License
All content is the property of RFE/RL, Inc. and is protected by U.S. and international copyright laws.
Users of this dataset must adhere to the RFE/RL Terms of Use. Specifically, users must credit RFE/RL in any reuse:
Copyright (c) 2026 RFE/RL, Inc. Used with the permission of Radio Free Europe/Radio Liberty.