Sindhi Question-Answering Dataset (SdQuAD)

Description

SdQuAD is a benchmark dataset developed to support question answering research for the Sindhi language. The dataset contains 15,065 context-question-answer pairs written in the Perso-Arabic Sindhi script. Annotation quality was validated through inter-annotator agreement, achieving an average F1 score of 0.838 and an Exact Match score of 0.863. Baseline experiments were conducted using traditional retrieval methods and multilingual transformer models, including mT5, mBERT, and XLM-ALBERTa. SdQuAD provides a valuable resource for developing, evaluating, and benchmarking Sindhi language understanding systems.

Specifics

Licensing

Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)

SdQuAD: A Benchmark Question Answering Dataset for Low-Resource Sindhi Language

Language

Sindhi (سنڌي) is an Indo-Aryan language of the Indo-European language family, primarily spoken in Pakistan and India. It is one of the oldest literary languages of South Asia and serves as the official language of Sindh province in Pakistan. Despite its large speaker base and rich linguistic heritage, Sindhi remains underrepresented in natural language processing research, particularly in machine reading comprehension and question answering.

Script

Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه

Dataset Structure

SdQuAD-Dataset/
│
└── SdQuAD

Metadata

Field	Details
Dataset Name	SdQuAD
Full Title	SdQuAD: A Benchmark Question Answering Dataset for Low-Resource Sindhi Language
Language	Sindhi (سنڌي)
Language Family	Indo-European — Indo-Aryan Branch
ISO 639-1 / 639-3	`sd` / `snd`
Script	Perso-Arabic Script (Sindhi, Unicode)
Encoding	UTF-8

Sample Text

{
  "question": "انساني جسم ۾ بي سيلز جو ڪم ڪهڙو آهي؟",
  "answer": ".اينٽي باڊيز ٺاهڻ"
}
{
    "question": "DNA جي ٻيڻي هيلڪس جوڙجڪ ڪهڙي سائنسدان بيان ڪئي؟",
    "answer": "واٽسن ۽ ڪرڪ"
  }
{
    "question": "اي ڊيٽا ريسٽوريشن ڇا آهي؟",
    "answer": "اي ڊيٽا ريسٽوريشن بيڪ اپ مان گم ٿيل ڊيٽا کي واپس آڻڻ آهي."
  }

Sindhi Question-Answering Dataset (SdQuAD)

Description

Specifics

Considerations

Processes

Metadata

SdQuAD: A Benchmark Question Answering Dataset for Low-Resource Sindhi Language

Language

Script

Dataset Structure

Metadata

Sample Text