License:
CC-BY-NC-SA-4.0
Steward:
MirasAITask: NLP
Release Date: 4/17/2026
Format: TXT
Size: 6.59 MB
Share
The Chittagonian (চাটগাঁইয়া, saṭgãia) Text Corpus is a curated collection of approximately 690,000 tokens reflecting the distinct linguistic and cultural identity of the Greater Chittagong region in Bangladesh. This dataset features a large collection of drama scripts, a unique domain that captures folklore, everyday social themes, and traditional cultural expressions within a single narrative framework. The corpus includes both original .docx files and plain text .txt files. This dual-format structure gives researchers access to both the source material and standardized text for computational modeling and dialectological study.
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlRestrictions/Special Constraints
This dataset is intended solely for research, scientific, and non-commercial purposes.
Forbidden Usage
This dataset is not for commercial purposes and is only for research and educational purposes.
Ethical Review
We have acquired and collected the data for every individual file after obtaining proper consent from the original authors through the coordinating publishing agency.
Intended Use
This dataset is intended for use in creating Natural Language Processing (NLP) tools and linguistic analysis for the Chittagonian language.
Chittagonian (চাটগাঁইয়া), also known as Chittagonian Bengali, is an Indo-Aryan language spoken primarily in the Greater Chittagong region of Bangladesh. With an estimated 13 million speakers, it is a member of the Bengali-Assamese sub-branch. While it shares cultural ties with Standard Bengali, it is linguistically distinct with unique phonetic and morphological properties that make it not inherently intelligible to speakers of other Bengali varieties. It shares significant mutual intelligibility with the Rohingya language.
অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ, ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য়, ৎ, ং, ঃ, ঁ।
Literature (Drama): A unique domain containing cultural folklore and social expressions.
Poetry: Aesthetic and cultural expression.
Folklore & Oral Tradition: Textual forms of traditional heritage.
Everyday Social Themes: Contextual reflections of community life.
Cultural Knowledge & Heritage.
The dataset is organized into two main folders:
01-TXT Files (UTF-8-Converted)
02-Original Files (DOCX)
Each folder contains 15 files acting as separate genre/domain containers.
01-Dramas Script Collection-31000-T.txt
02-Dramas Script Collection-37500-T.txt
03-Dramas Script Collection-13500-T.txt
04-Dramas Script Collection-30500-T.txt
05-Dramas Script Collection-25000-T.txt
06-Dramas Script Collection-49000-T.txt
07-Dramas Script Collection-51000-T.txt
08-Dramas Script Collection-77000-T.txt
09-Folk Dramas Script Collection-80500-T.txt
10-Folk Dramas Script Collection-86000-T.txt
11-Dramas Script Collection-61000-T.txt
12-Dramas Script Collection-30000-T.txt
13-Dramas Script Collection-31000-T.txt
14-Dramas Script Collection-67000-T.txt
15-Dramas Script Collection-20000-T.txt
UTF-8 Conversion: Detailed conversion from .docx to UTF-8 Unicode.
Normalization: Unicode normalization for script consistency.
Cleanup: Removal of white-space, punctuation, and stray symbols/markup.
Rezamahi Publishing Agency Rajshahi, Bangladesh.
সাংবাদিক আপনার কি সমস্যা কি হইছে আপনি সাংবাদিক তুলে বন স্যার আপনার দুঃখের কথা
তুই যদি তারাবিতে বিলের মহন গরিব আমি বালাই আলু না মনে সংসার কে চলিব
এই রোডটা তো আপনাকে দেখানোই হয়েছে এই রোড দিয়ে আউটার রিং রোডে উঠে যাওয়া যায় কাতারের বাম পাশে আর সামনে যেটা দেখাচ্ছি
থেকে কি নিলে আতা হই এমনি মিছা কথা হই ওডা তুই আই বাইলা খাইলো বন্ধু তুই মিছা কথা কত আ
আপনাদেরকে আমি এখন অতি সেরকম অতি জাগা নাগা একটি ফাট একটা গান শোনাচ্ছি আপনাদের যদি ভালো না লাগে তাহলে আপনারা আমাদেরকে