License:
CC-BY-NC-4.0
Steward:
MirasAITask: NLP
Release Date: 4/21/2026
Format: TXT, DOCX
Size: 3.61 MB
Share
The Noakhalian (নোয়াখাইল্লা) Text Corpus is a systematically curated linguistic resource comprising approximately 504,500 tokens of the Noakhalian language variety. This dataset is primarily composed of drama scripts, a multifaceted domain that provides rich insights into the phonetic, morphological, and sociolinguistic nuances of the Greater Noakhali region. By capturing authentic dialogue, cultural folklore, and everyday social interactions, the corpus serves as a valuable resource for dialectological studies and the development of computational models for low-resource Indo-Aryan languages. The data is provided in original .docx source documents and plain text formats to support modern NLP workflows.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes.
Forbidden Usage
Users agree not to attempt to determine the identity of individuals mentioned in the text and are strictly prohibited from using this data for commercial purposes or to train deceptive AI systems.
Ethical Review
Every individual file in this corpus was acquired and compiled following the procurement of explicit informed consent from the original authors via the Rezamahi Publishing Agency.
Intended Use
This dataset is intended for use in developing Natural Language Processing (NLP) tools and conducting linguistic analysis for the Noakhalian language variety.
Noakhalian (নোয়াখাইল্লা) (also known as Noakhailla) is an Eastern Indo-Aryan language variety spoken in Greater Noakhali, South Tripura, and parts of the Chittagong Division. Historically, linguists such as Grierson categorized it within the Southeastern Bengali group, while others like Suniti Kumar Chatterji and Sukumar Sen placed it under the Vanga group. It is recognized for its unique phonetic and morphological properties that distinguish it from Standard Bengali and other eastern dialects.
অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ, ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য়, ৎ, ং, ঃ, ঁ
Literature (Drama): A primary and unique domain encompassing cultural folklore, social dynamics, and news-style narratives within a dramatic framework.
Poetry: Aesthetic and cultural expression through verse.
Folklore & Oral Tradition: Written records of traditional stories and heritage.
Everyday Social Themes: Contextual reflections of community life.
Cultural Knowledge & Heritage.
The dataset is organized into two primary directories:
01-TXT Files (UTF-8-Converted): Standardized, machine-readable plain text.
02-Original Files (DOCX): Original source documents.
Each folder contains 6 files categorized by domain.
Rezamahi Publishing Agency, Rajshahi, Bangladesh.
01-Noakhali-dialect-drama-collec.txt
02-Noakhali-dialect-drama-collec.txt
03-Noakhali-dialect-drama-collec.txt
04-Noakhali-dialect-drama-collec.txt
05-Noakhali-dialect-drama-collec.txt
06-Noakhali-dialect-drama-collec.txt
Detailed Conversion: Every file was meticulously converted from .docx to UTF-8 encoded text using a high-precision methodology.
Unicode Normalization: Standardized to ensure consistent rendering of characters and regional diacritics.
Refined Cleanup: Automated and manual removal of stray symbols, markup, and formatting artifacts.
এরে সোহেলের বউ কোনাই গেলি দেখছ বান্দির ঘরের বান্দিরে ডাকতেছি বান্দির
থাকার ব্যবস্থা কই দিতে পারেন হয় আম্মা ও কিন্তু কথাটা খারাপ কয় নাই থাকেন সবাই মিলে মিশে থাকি দাঁচার রূপ খুশি ভাবি তো
নাটক এই আছিয়া তুই কও এই লড়াই কিতা লাইগা লাগাইছো তুইরা আমারে কও
কিতা চেক দিমু কিতা চেক দিবা বুঝতেছো না আই আমার বাপের বাড়ি যাইতাছি আমার বাপের বাড়ি থেকে যেসব কাপড় চোপড় আনছিলাম ওই ভরছি এছাড়া তোর
বাট আই লাভ ইউ জানে ম্যাম আমি তোমাকে লাভ করি। আমি তোমাকে কোনভাবে লস কইতাম পারতাম না। হাসছানি?