License:
CC-BY-SA-4.0
Steward:
CommunityDataset ID:
cmqlfovw202l7mm07fiwxwa9e
Task: NLP
Release Date: 6/19/2026
Format: TXT
Size: 2.51 MB
Share
This dataset is a Sindhi Named Entity Recognition (SiNER) developed to support research and development in Natural Language Processing (NLP) and language technology for the Sindhi language. The corpus contains Sindhi text annotated with named entities, collected from publicly available written sources, providing rich linguistic variety in vocabulary, sentence structure, and writing styles while capturing the real-world contexts in which named entities appear. The dataset is intended for tasks such as named entity recognition, information extraction, question answering, machine translation, and other NLP applications involving Sindhi. It can also support language modeling, linguistic analysis, educational purposes, and the development of low-resource language technologies.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
This dataset is released for research and educational purposes .The user is supposed to get permission for commercial usage. Any use that infringes on privacy, violates applicable laws, or causes harm to individuals or communities is prohibited. Appropriate attribution is required.
Forbidden Usage
Any use of this dataset must respect ethical AI principles and applicable data protection requirements.
Intended Use
This dataset is intended for research, education, and the development of Sindhi natural language processing applications.
Sindhi (سنڌي) is an Indo-Aryan language of the Indo-European language family, belonging to the Northwestern Indo-Aryan branch. It is the official language of the Sindh province in Pakistan and is one of the 22 scheduled languages of India, where it is spoken across Gujarat, Rajasthan, Maharashtra, and Madhya Pradesh, as well as among diaspora communities in the Gulf region, the United Kingdom, and North America. According to Glottolog it belongs to the Sindhi–Lahnda group of the Northwestern Indo-Aryan zone. Sindhi has a rich literary tradition spanning over a thousand years, with major contributions in Sufi devotional poetry, prose, and folk literature most famously the verse of Shah Abdul Latif Bhittai. In Pakistan, Sindhi is written in a Perso-Arabic script: in India it may be written in either Perso-Arabic or Devanagari. Many speakers are bilingual in Urdu, Hindi, or English depending on their region and level of education.
Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
SiNer - Sindhi Named Entity Recognition/
│
└── SiNER-dataset
| Field | Details |
|---|---|
| Dataset Name | Sindhi Named Entity Recognition Dataset (SiNer) |
| Language | Sindhi (سنڌي) |
| Language Family | Indo-European — Indo-Aryan Branch (Northwestern) |
| ISO 639-1 / 639-3 | sd / snd |
| Glottocode | sind1272 |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
TXT
## Sample Text
شروع O
ٿيل O
آپريشن B-EVENT
ضرب I-EVENT
عضب I-EVENT
۽ O
سياسي O
۽ O
عسڪري O
قيادت O
جي O
گڏيل O
صلاح O
سان O
قومي B-EVENT
ايڪشن I-EVENT
پلان I-EVENT
جوڙڻ O
کانپوءِ O
جيتوڻيڪ O
دعوائون O
ته O
اهي O
سامهون O
اينديون O
رهيون O
ته O
دهشتگردن O
جا O
ٿاڪ O
۽ O
ٺڪاڻا O
تباهه O
ڪيا O
ويا O
آهن O
، O
هنن O
جي O
نيٽ O
ورڪ O
کي O
نابود O
ڪيو O
ويو