License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmr29k8m002tvmk07chd382j5
Task: NLP
Release Date: 7/1/2026
Format: JSON
Size: 1.13 MB
Share
This dataset treats Sindhi Word Segmentation (SWS) as a sequence labeling task by assigning character-level tags (B, I, E, S, X) to unlabeled Sindhi text for word boundary detection. It is designed for training and evaluating sequence labeling models, including CRF, LSTM, and transformer-based architectures.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is provided for research and educational purposes only.Commercial use requires prior permission from the dataset creators, and proper attribution must be provided in all derived works.
Forbidden Usage
This dataset must not be used for any commercial, harmful, or unethical purposes, including generating offensive or discriminatory content or misrepresenting cultural expressions. Any use that violates applicable laws, privacy, or cultural sensitivity is strictly prohibited.
Ethical Review
This dataset was created for research and educational purposes using publicly available Sindhi text. It does not contain personally identifiable or sensitive information. Users are expected to use the dataset responsibly, in accordance with ethical guidelines and applicable laws. Commercial use requires prior permission from the dataset creators.
Intended Use
This dataset is intended for research and educational use in Sindhi Word Segmentation. It supports the training and evaluation of sequence labeling models, including CRF, LSTM, BiLSTM, and transformer-based architectures for automatic word boundary detection in Sindhi text.
Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. Despite its rich literary heritage, it remains a low-resource language in NLP, particularly for word segmentation and sequence labeling tasks.
Perso-Arabic Script (Sindhi)
ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
Sindhi-Word-Segmentation/
│
├── labelled_dataset.json
├── labelled_sentences.txt
└── sd_seqlabelling.txt
└── README.md
| Field | Details |
|---|---|
| Dataset Name | Sindhi Word Segmentation Dataset (SdSEG) |
| Language | Sindhi (سنڌي) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO 639-1 / 639-3 | sd / snd |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Domain | Natural Language Processing |
| Task Type | Sequence Labeling / Word Segmentation |
| Encoding | UTF-8 |
| Format | Sentence + Label Sequence |
{
"sentence": "متاثر علائقن ۾ رينجرز مقرر ڪرڻ جي گهر، آبادگارن کي تباهه ڪري ڇڏيو اٿئون:",
{
"sentence": "متاثر علائقن ۾ رينجرز مقرر ڪرڻ جي گهر، آبادگارن کي تباهه ڪري ڇڏيو اٿئون:",
"labels": ["E", "I", "I", "I", "B", "X", "E", "I", "I", "I", "I", "B", "X", "S", "..."]
}
{
"sentence": "فهميده مرزا. فنڊز ۾ گهوٻيون، واهن۽شاخن جي کاٽي نه ٿيڻ ڪري بدين، جوهي ۽ ميرپور خاص۾پاڻي کوٽ آهي.",
"labels": ["E", "I", "I", "I", "I", "B", "X", "E", "I", "I", "B", "S", "X", "E", "I", "..."]
}