Sylheti Text Corpus by Haque Publishers

Description

The Sylheti Text Corpus is a professionally assembled linguistic dataset comprising approximately 524,000 tokens of the Sylheti language. This collection focuses on drama scripts, which preserve cultural folklore, social nuances, and everyday idiomatic expressions. Sylheti is an Eastern Indo-Aryan language with distinct phonology and is considered vulnerable by UNESCO, making this corpus useful for both language preservation and computational research. The dataset is available in original .docx source documents and plain text files, supporting tasks such as dialectological analysis and low-resource language modeling.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

https://spdx.org/licenses/CC-BY-NC-4.0.html

Considerations

Restrictions/Special Constraints

This dataset is intended exclusively for non-commercial research, academic inquiry, and scientific purposes.

Forbidden Usage

Language

Sylheti (ISO 639-3: syl) is an Eastern Indo-Aryan language spoken by approximately 11 million people worldwide, primarily in the Sylhet Division of Bangladesh and the Barak Valley of Assam, India. It is linguistically distinct from Standard Bengali due to significant differences in grammar and its rare tonal nature. While historically written in the Syloti Nagri script (an abugida consisting of 33 symbols), modern usage frequently employs the Bengali script.

Script

Bengali Script (Used in this dataset): অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ, ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য়, ৎ, ং, ঃ, ঁ

Domains of the Text

Literature (Drama): A primary and unique domain encompassing cultural folklore, social dynamics, and news-style narratives within a dramatic framework.
Poetry: Aesthetic and cultural expression.
Folklore & Oral Tradition: Written records of traditional stories and heritage.
Everyday Social Themes: Contextual reflections of community life.
Cultural Knowledge & Heritage.

Publisher

Haque Publishing Agency, Rajshahi, Bangladesh.

Dataset Structure

The dataset is organized into two primary directories:

01-TXT Files (UTF-8-Converted): Standardized machine-readable plain text.
02-Original Files (DOCX): Original source documents.
Each folder contains 4 files categorized by domain.

File-Level Metadata

01-Syheti-drama-collection - 352000 T.txt
02- Syheti-drama-collection - 63000 T.txt
03- Syheti-drama-collection - 71000 T.txt
04- Syheti-drama-collection - 38000 T.txt

Cleaning and Processing

Detailed Conversion: Every file was meticulously converted from .docx to UTF-8 encoded text.
Unicode Normalization: Standardized to ensure consistent rendering of characters and tone-related diacritics.
Refined Cleanup: Automated and manual removal of stray symbols, markup, and formatting artifacts.

Sample Text

সন্দেহ করবা সবরে হুম সবর ঘরে ঘরে গ চেক করবা যদি গাই তখন আমার ঘরে অত টিকা বুঝানিতে হইব আল্লাহ আমার হাত ঠান্ডা হই গেছে
তুমি বুঝ তুমি মাসে মাত্র আমি লগে মাত্র কত পাড়া আমি মাঝে মাত্রা না আমি গাড়ির মালিক আমি ড্রাইভার আছে ড্রাইভার আই গাড়ি চালাই
তিন রুজ গুজরিল না পাই দেখিতে ॥জুদাইর আনল আর না পারি শহিতে *পেরেমের তির লাগিআছে কলিজাএ ॥নএআন খুলিআ দেখ ডাকে ফাতিমাএ *আরজ করিলা জত নবিজির গুছর ॥না চাইলা হজরতে না দিলা উত্তর %উম্মতের পেরেম জিগরে নবিজির ॥কি করে উম্মতের গতি এলাহি কাদির *||৪
একদিন নদীর কান্দাত একগু গাছ-ও কাঠ কাটার সময় তার কুড়াল ওগু নদীর পানিত পড়ি গেছিল
ফুড়িটায় সাপ্তাখানেк আগে তাইর লগে ঘটি যাওয়া ঘটনা আমরার লগে শেয়ার করছিল