License:
CC-BY-NC-4.0
Steward:
MirasAITask: NLP
Release Date: 4/21/2026
Format: TXT
Size: 3.53 MB
Share
The Sylheti Text Corpus is a professionally assembled linguistic dataset comprising approximately 524,000 tokens of the Sylheti language. This collection focuses on drama scripts, which preserve cultural folklore, social nuances, and everyday idiomatic expressions. Sylheti is an Eastern Indo-Aryan language with distinct phonology and is considered vulnerable by UNESCO, making this corpus useful for both language preservation and computational research. The dataset is available in original .docx source documents and plain text files, supporting tasks such as dialectological analysis and low-resource language modeling.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes.
Forbidden Usage
Users agree not to attempt to determine the identity of individuals within the text and are strictly prohibited from using this data for commercial purposes or the training of harmful generative models.
Ethical Review
Every individual file in this corpus was acquired and compiled following the procurement of explicit informed consent from the original authors via Haque Publishing Agency.
Intended Use
This dataset is intended for Natural Language Processing (NLP) research and the development of computational linguistic tools for the Sylheti language.
Sylheti (ISO 639-3: syl) is an Eastern Indo-Aryan language spoken by approximately 11 million people worldwide, primarily in the Sylhet Division of Bangladesh and the Barak Valley of Assam, India. It is linguistically distinct from Standard Bengali due to significant differences in grammar and its rare tonal nature. While historically written in the Syloti Nagri script (an abugida consisting of 33 symbols), modern usage frequently employs the Bengali script.
Bengali Script (Used in this dataset): অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ, ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য়, ৎ, ং, ঃ, ঁ
Literature (Drama): A primary and unique domain encompassing cultural folklore, social dynamics, and news-style narratives within a dramatic framework.
Poetry: Aesthetic and cultural expression.
Folklore & Oral Tradition: Written records of traditional stories and heritage.
Everyday Social Themes: Contextual reflections of community life.
Cultural Knowledge & Heritage.
Haque Publishing Agency, Rajshahi, Bangladesh.
The dataset is organized into two primary directories:
01-TXT Files (UTF-8-Converted): Standardized machine-readable plain text.
02-Original Files (DOCX): Original source documents.
Each folder contains 4 files categorized by domain.
01-Syheti-drama-collection - 352000 T.txt
02- Syheti-drama-collection - 63000 T.txt
03- Syheti-drama-collection - 71000 T.txt
04- Syheti-drama-collection - 38000 T.txt
Detailed Conversion: Every file was meticulously converted from .docx to UTF-8 encoded text.
Unicode Normalization: Standardized to ensure consistent rendering of characters and tone-related diacritics.
Refined Cleanup: Automated and manual removal of stray symbols, markup, and formatting artifacts.
সন্দেহ করবা সবরে হুম সবর ঘরে ঘরে গ চেক করবা যদি গাই তখন আমার ঘরে অত টিকা বুঝানিতে হইব আল্লাহ আমার হাত ঠান্ডা হই গেছে
তুমি বুঝ তুমি মাসে মাত্র আমি লগে মাত্র কত পাড়া আমি মাঝে মাত্রা না আমি গাড়ির মালিক আমি ড্রাইভার আছে ড্রাইভার আই গাড়ি চালাই
তিন রুজ গুজরিল না পাই দেখিতে ॥জুদাইর আনল আর না পারি শহিতে *পেরেমের তির লাগিআছে কলিজাএ ॥নএআন খুলিআ দেখ ডাকে ফাতিমাএ *আরজ করিলা জত নবিজির গুছর ॥না চাইলা হজরতে না দিলা উত্তর %উম্মতের পেরেম জিগরে নবিজির ॥কি করে উম্মতের গতি এলাহি কাদির *||৪
একদিন নদীর কান্দাত একগু গাছ-ও কাঠ কাটার সময় তার কুড়াল ওগু নদীর পানিত পড়ি গেছিল
ফুড়িটায় সাপ্তাখানেк আগে তাইর লগে ঘটি যাওয়া ঘটনা আমরার লগে শেয়ার করছিল