Kaler Kantho Bengali Newspaper Corpus

License icon

License:

CC-BY-NC-4.0

Shield icon

Steward:

Kaler Kontho

Task: NLP

Release Date: 4/13/2026

Format: DOCX

Size: 33.11 MB


Share

Description

The Kaler Kantho Bengali Newspaper Corpus is a large-scale text dataset containing over 10 million tokens collected from the digital archives of Kaler Kantho, a major daily newspaper in Bangladesh. It represents modern Bengali journalistic writing across domains such as national politics, international affairs, social issues, and cultural content.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended for research, educational, and non-commercial use only.

Forbidden Usage

This dataset is not for commercial purposes and is only for research and educational purposes.

Processes

Ethical Review

Data was ethically sourced from public journalistic archives for linguistic research purposes.

Intended Use

This dataset is intended for Natural Language Processing (NLP) of the Bengali language.

Metadata

Language

Bengali (বাংলা), also known as Bangla, is a classical Indo-Aryan language primarily spoken in the Bengal region of South Asia. With over 242 million native speakers as of 2025, it ranks as the sixth most spoken native language in the world. It is the official and national language of Bangladesh and holds official status in several Indian states, including West Bengal and Tripura. Bengali was officially accorded the status of a classical language in 2024, honoring its rich millennium-old literary history and its role in the historic Bengali Language Movement.

Bengali Alphabets

অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য র ল শ ষ স হ ড় ঢ় য়ৎ ং ঃ ঁ

Domains of the Text

  • Literature (News reports)

  • Poetry (Aesthetic / cultural expression)

  • Folklore & Oral Tradition (Textual form)

  • Everyday Social Themes (As reflected in texts)

  • Cultural Knowledge & Heritage

  • News reports (National & International)

  • Articles (Aesthetic / cultural expression)

Dataset Structure

  • The dataset consists of 28 Microsoft Word (.docx) files.

  • Each file acts as a separate genre or domain container for the corpus.

  • Total token count: 10+ million.

File-level metadata

  • Kaler kontho_part_1-359000

  • Kaler kontho_part_2-370000

  • Kaler kontho_part_3-358000

  • Kaler kontho_part_4-360000

  • Kaler kontho_part_5-363000

  • Kaler kontho_part_6-370000

  • Kaler kontho_part_7-368000

  • Kaler kontho_part_8-360000

  • Kaler kontho_part_9-366000

  • Kaler kontho_part_10-368000

  • Kaler kontho_part_11-371000

  • Kaler kontho_part_12-366000

  • Kaler kontho_part_13-361000

  • Kaler kontho_part_14-359000

  • Kaler kontho_part_15-371000

  • Kaler kontho_part_16-365000

  • Kaler kontho_part_17-370000

  • Kaler kontho_part_18-367000

  • Kaler kontho_part_19-370000

  • Kaler kontho_part_20-371000

  • Kaler kontho_part_21-309000

  • Kaler kontho_part_22-369000

  • Kaler kontho_part_23-367000

  • Kaler kontho_part_24-370000

  • Kaler kontho_part_25-366000

  • Kaler kontho_part_26-365000

  • Kaler kontho_part_27-371000

  • Kaler kontho_part_28-369000

Recommended Processing

  • File Format: The data is provided in Microsoft Word (.docx) format.

  • Normalization: Users are encouraged to apply Unicode normalization during data extraction to ensure consistent rendering.

  • Cleanup: Removal of white-space, punctuation, and stray formatting artifacts from Word documents is recommended.

Sample Text

  • (ক) বাংলাদেশের জলবায়ু মোটামুটি সমভাবাপন্ন।

  • (খ) ভূমিকম্প একটি প্রাকৃতিক দুর্যোগ।

  • তিনি বলেন, তৃণমূল অর্থনীতিতে গতি সঞ্চার করতেই তাঁদের লক্ষ্য করে প্রধানমন্ত্রী শেখ হাসিনা হাজার কোটি টাকার প্রণোদনা দিয়েছেন। কিন্তু সেটির সঠিক বাস্তবায়ন না হওয়ায় আমি খুব একটা আশাবাদী নই।

  • ১৬। ১৯৭১ সালের কত তারিখে ঢাকার বিভিন্ন সামরিক অবস্থানের ওপর যৌথ বাহিনীর বিমান হামলা চলে?

  • ৮। নিচের অনুচ্ছেদটির যথাস্থানে বিরামচিহ্ন বসিয়ে উত্তরপত্রে লেখো। ৫ গ্রামের নাম আনন্দপুর মামার বাড়ি কথায় আছে মামার বাড়ি রসের হাঁড়ি আসলেই তাই