Kaler Kantho Bengali Newspaper Corpus
License:
CC-BY-NC-4.0
Steward:
Kaler KonthoTask: NLP
Release Date: 4/13/2026
Format: DOCX
Size: 33.11 MB
Share
Description
The Kaler Kantho Bengali Newspaper Corpus is a large-scale text dataset containing over 10 million tokens collected from the digital archives of Kaler Kantho, a major daily newspaper in Bangladesh. It represents modern Bengali journalistic writing across domains such as national politics, international affairs, social issues, and cultural content.
Specifics
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlConsiderations
Restrictions/Special Constraints
This dataset is intended for research, educational, and non-commercial use only.
Forbidden Usage
This dataset is not for commercial purposes and is only for research and educational purposes.
Processes
Ethical Review
Data was ethically sourced from public journalistic archives for linguistic research purposes.
Intended Use
This dataset is intended for Natural Language Processing (NLP) of the Bengali language.
Metadata
Language
Bengali (বাংলা), also known as Bangla, is a classical Indo-Aryan language primarily spoken in the Bengal region of South Asia. With over 242 million native speakers as of 2025, it ranks as the sixth most spoken native language in the world. It is the official and national language of Bangladesh and holds official status in several Indian states, including West Bengal and Tripura. Bengali was officially accorded the status of a classical language in 2024, honoring its rich millennium-old literary history and its role in the historic Bengali Language Movement.
Bengali Alphabets
অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য র ল শ ষ স হ ড় ঢ় য়ৎ ং ঃ ঁ
Domains of the Text
Literature (News reports)
Poetry (Aesthetic / cultural expression)
Folklore & Oral Tradition (Textual form)
Everyday Social Themes (As reflected in texts)
Cultural Knowledge & Heritage
News reports (National & International)
Articles (Aesthetic / cultural expression)
Dataset Structure
The dataset consists of 28 Microsoft Word (.docx) files.
Each file acts as a separate genre or domain container for the corpus.
Total token count: 10+ million.
File-level metadata
Kaler kontho_part_1-359000
Kaler kontho_part_2-370000
Kaler kontho_part_3-358000
Kaler kontho_part_4-360000
Kaler kontho_part_5-363000
Kaler kontho_part_6-370000
Kaler kontho_part_7-368000
Kaler kontho_part_8-360000
Kaler kontho_part_9-366000
Kaler kontho_part_10-368000
Kaler kontho_part_11-371000
Kaler kontho_part_12-366000
Kaler kontho_part_13-361000
Kaler kontho_part_14-359000
Kaler kontho_part_15-371000
Kaler kontho_part_16-365000
Kaler kontho_part_17-370000
Kaler kontho_part_18-367000
Kaler kontho_part_19-370000
Kaler kontho_part_20-371000
Kaler kontho_part_21-309000
Kaler kontho_part_22-369000
Kaler kontho_part_23-367000
Kaler kontho_part_24-370000
Kaler kontho_part_25-366000
Kaler kontho_part_26-365000
Kaler kontho_part_27-371000
Kaler kontho_part_28-369000
Recommended Processing
File Format: The data is provided in Microsoft Word (.docx) format.
Normalization: Users are encouraged to apply Unicode normalization during data extraction to ensure consistent rendering.
Cleanup: Removal of white-space, punctuation, and stray formatting artifacts from Word documents is recommended.
Sample Text
(ক) বাংলাদেশের জলবায়ু মোটামুটি সমভাবাপন্ন।
(খ) ভূমিকম্প একটি প্রাকৃতিক দুর্যোগ।
তিনি বলেন, তৃণমূল অর্থনীতিতে গতি সঞ্চার করতেই তাঁদের লক্ষ্য করে প্রধানমন্ত্রী শেখ হাসিনা হাজার কোটি টাকার প্রণোদনা দিয়েছেন। কিন্তু সেটির সঠিক বাস্তবায়ন না হওয়ায় আমি খুব একটা আশাবাদী নই।
১৬। ১৯৭১ সালের কত তারিখে ঢাকার বিভিন্ন সামরিক অবস্থানের ওপর যৌথ বাহিনীর বিমান হামলা চলে?
৮। নিচের অনুচ্ছেদটির যথাস্থানে বিরামচিহ্ন বসিয়ে উত্তরপত্রে লেখো। ৫ গ্রামের নাম আনন্দপুর মামার বাড়ি কথায় আছে মামার বাড়ি রসের হাঁড়ি আসলেই তাই