License:
CC-BY-NC-4.0
Steward:
MirasAITask: NLP
Release Date: 4/17/2026
Format: TXT
Size: 3.68 MB
Share
The Rangpuri (অংপুরি Ôṅgpuri) Text Corpus is a professionally curated collection of approximately 501,500 tokens representing the linguistic and cultural heritage of the Rangpur Division in Bangladesh and neighboring regions. The dataset includes a diverse range of genres such as poetry, folklore, and drama scripts, bringing together everyday social themes, cultural expressions, and oral traditions in one resource. It is provided in original .docx documents and plain text files. This corpus is a valuable resource for researchers working on the Kamta group of languages and for the development of computational tools for low-resource Indo-Aryan varieties.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes and must not be used for commercial distribution.
Forbidden Usage
You agree not to attempt to identify any individuals represented in the text and any commercial redistribution or the use of this data to train harmful or deceptive models is strictly prohibited.
Ethical Review
The data for every individual file in this corpus was acquired and compiled following the procurement of explicit informed consent from the original authors via the coordinating publishing agency.
Intended Use
This dataset is designed to facilitate Natural Language Processing (NLP) research and the development of linguistic tools for the Rangpuri language.
Rangpuri (অংপুরি or অমপুরি), also commonly known as Bahe, Deshi bhasha, or Anchalit bhasha, is an Eastern Indo-Aryan language of the Bengali-Assamese branch. It is spoken across the Rangpur Division in Bangladesh, northern West Bengal, and western Goalpara in Assam, India. According to Glottolog, it forms the Central-Eastern Kamta group alongside the Kamta language, Rajbanshi, and Surjapuri. Many speakers are bilingual in Bengali or Assamese depending on their respective regions.
অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ, ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য়, ৎ, ং, ঃ, ঁ
Literature (Drama): A primary and unique domain that encapsulates cultural folklore, social dynamics, and everyday linguistic expressions.
Poetry: Aesthetic and cultural expression through verse.
Folklore & Oral Tradition: Written records of traditional stories and heritage.
Everyday Social Themes: Contextual reflections of community life and news-style narratives within scripts.
Cultural Knowledge & Heritage.
The dataset is organized into two primary directories:
01-TXT Files (UTF-8-Converted): Machine-readable plain text files.
02-Original Files (DOCX): Source documents in Microsoft Word format.
01-Drama Script Collection - 15500 T.txt
02-Drama Script Collection - 44000 T.txt
03-Drama Script Collection - 42500 T.txt
04-Drama Script Collection - 134500 T.txt
05-Drama Script Collection - 75000 T.txt
06-Drama Script Collection - 100500 T.txt
07-Drama Script Collection - 89500 T.txt
Detailed Conversion: Every file was converted from .docx to UTF-8 encoded text with a high level of precision.
Unicode Normalization: Standardized to ensure consistent rendering of characters and diacritics.
Refined Cleanup: Automated and manual removal of stray symbols, markup, and formatting artifacts.
Rezamahi Publishing Agency, Rajshahi, Bangladesh.
তাইলে আমরাগোও একদিন এলগাড়ি চড়িয়া অংপুর যাই হুম আচ্ছা আচ্ছা টপ
উজির গীদাল : শুন বাবা যে ধল্লায় তোর বাড়িঘর ভাঙ্গি বালুর চর ফেলাইচে সেই ধললায় একদিন সুজলা সুফলা শস্য শ্যামলার বান ডাকি আইনবে। |
হ্যাঁ সমস্যা কি বল সোমা ভিজিট বলো স্যারের ভিজিট হচ্ছে টাকা আর ম্যাডামের
যকন কী ইয়্যা হইছিল, ওই যে সুমন আর হইলো আলম জামাই, আর আমরা তো মনে কর তিনজন ছিলাম একসাতে যেমন, সিরাজ, আমি, তুই তিনজন ছিলাম, পাশাপাশি আমার সাতে
আমের গাচে যায় রে কাউয়া পাকা আম খায় ছোট ছোট চ্যাংড়া চেংড়ি তাক দ্যায় রে। (হায় হায় দারুণ বিদি