License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmqnyezls0403mm07q1uxjb51
Task: LLM
Release Date: 6/21/2026
Format: JSON
Size: 500.50 KB
Share
SdQuAD is a benchmark dataset developed to support question answering research for the Sindhi language. The dataset contains 15,065 context-question-answer pairs written in the Perso-Arabic Sindhi script. Annotation quality was validated through inter-annotator agreement, achieving an average F1 score of 0.838 and an Exact Match score of 0.863. Baseline experiments were conducted using traditional retrieval methods and multilingual transformer models, including mT5, mBERT, and XLM-ALBERTa. SdQuAD provides a valuable resource for developing, evaluating, and benchmarking Sindhi language understanding systems.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
Restrictions/Special Constraints
This dataset is released for research and educational purposes in NLP. Users must comply with applicable legal and ethical standards and provide appropriate attribution when using the dataset.
Forbidden Usage
This dataset must not be used for any commercial, harmful, or unethical purposes, including generating offensive or discriminatory content or misrepresenting cultural expressions. Any use that violates applicable laws, privacy, or cultural sensitivity is strictly prohibited.
Intended Use
SdQuAD is intended for developing and evaluating question answering for the Sindhi language.
Sindhi (سنڌي) is an Indo-Aryan language of the Indo-European language family, primarily spoken in Pakistan and India. It is one of the oldest literary languages of South Asia and serves as the official language of Sindh province in Pakistan. Despite its large speaker base and rich linguistic heritage, Sindhi remains underrepresented in natural language processing research, particularly in machine reading comprehension and question answering.
Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
SdQuAD-Dataset/
│
└── SdQuAD
| Field | Details |
|---|---|
| Dataset Name | SdQuAD |
| Full Title | SdQuAD: A Benchmark Question Answering Dataset for Low-Resource Sindhi Language |
| Language | Sindhi (سنڌي) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO 639-1 / 639-3 | sd / snd |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Encoding | UTF-8 |
{
"question": "انساني جسم ۾ بي سيلز جو ڪم ڪهڙو آهي؟",
"answer": ".اينٽي باڊيز ٺاهڻ"
}
{
"question": "DNA جي ٻيڻي هيلڪس جوڙجڪ ڪهڙي سائنسدان بيان ڪئي؟",
"answer": "واٽسن ۽ ڪرڪ"
}
{
"question": "اي ڊيٽا ريسٽوريشن ڇا آهي؟",
"answer": "اي ڊيٽا ريسٽوريشن بيڪ اپ مان گم ٿيل ڊيٽا کي واپس آڻڻ آهي."
}