License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmqnyiefo03r2nr07t906b7fg
Task: NLP
Release Date: 6/21/2026
Format: TXT
Size: 874.04 KB
Share
The SiPOS dataset is a benchmark resource for Part-of-Speech tagging in the low-resource Sindhi language. It contains over 293,000 tokens annotated with sixteen universal POS categories. The dataset was manually annotated by two experienced native annotators using the Doccano annotation tool, achieving an inter-annotator agreement of 0.872, ensuring high-quality and consistent labels. SiPOS is designed to support research in syntactic and morphological analysis of Sindhi and enables the development of robust sequence labeling models for low-resource NLP.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
Restrictions/Special Constraints
This dataset is provided for research and educational purposes only. Users must ensure ethical and legal compliance when using the dataset. Proper attribution is required in all derived works.
Forbidden Usage
This dataset must not be used for commercial deployment, surveillance systems, harmful content generation, or any activity that violates privacy, laws, or ethical guidelines.
Intended Use
SiPOS is intended for research in Part-of-Speech tagging, syntactic parsing, morphological analysis, and sequence labeling for Sindhi NLP systems.
Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. Despite its rich linguistic tradition, it remains a low-resource language in NLP, especially for syntactic and morphological annotation tasks.
Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
SiPOS/
│
└── SiPOS tagset
| Field | Details |
|---|---|
| Dataset Name | SiPOS |
| Language | Sindhi (سنڌي) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO 639-1 / 639-3 | sd / snd |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Annotation Tool | Doccano |
| Encoding | UTF-8 |
| Format | CoNLL / token-level annotation |
اسڪول NNP PROPN اسم خاص
تي ADP ADP حرفِ جر
ٿيل VB VERB فعل
حملي NN NOUN اسم
کانپوءِ ADP ADP حرفِ جر