License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmqnyh34j03qynr07nhstiajm
Task: NLP
Release Date: 6/21/2026
Format: JSON
Size: 90.65 KB
Share
The SiNFLuD is a human-labelled dataset developed to support research for the classification of figurative and literal language for low-resourceSindhi language. The dataset includes a diverse collection of idioms, metaphors, smilies, and proverbs collected from various web resources, representing culturally rich and context-dependent expressions. It is designed to enable computational models to move beyond literal meaning and capture deeper semantic and cultural interpretations in Sindhi text. This dataset is intended for training and evaluating NLP models for tasks such as figurative language classification, and semantic analysis. By providing structured annotations of idiomatic and metaphorical expressions, it helps bridge the resource gap in Sindhi language processing and supports the development of more culturally aware language technologies.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is provided for research and educational purposes only. Users must comply with ethical standards, legal requirements, and cultural sensitivity when using the dataset. Figurative expressions must not be misrepresented or used in harmful contexts. Proper attribution is required in all derived works.
Forbidden Usage
This dataset must not be used for any commercial, harmful, or unethical purposes, including generating offensive or discriminatory content or misrepresenting cultural expressions. Any use that violates applicable laws, privacy, or cultural sensitivity is strictly prohibited.
Intended Use
This dataset is intended for research in figurative language classification, semantic analysis, and NLP development for Sindhi. It supports low-resource language modeling and cultural language understanding tasks.
Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. It has a strong literary and cultural tradition but remains a low-resource language in NLP, especially for figurative and semantic understanding tasks.
Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
SiNFLuD-Dataset/
│
└── SiNFLuD
| Field | Details |
|---|---|
| Dataset Name | Sindhi Figurative Language Dataset |
| Language | Sindhi (سنڌي) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO 639-1 / 639-3 | sd / snd |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Domain | Figurative Language / NLP |
| Task Type | Text Classification / Figurative Language Detection |
| Encoding | UTF-8 |
| Format | JSON |
{
"text": ".اھو ڪي ڪجي جو مينهن وسندي ڪم اچي",
"label_name": "non-literal",
"type": "idiom"
},
{
"text":"سائي کي سهي ڪو نه بکئي کي ڏئي ڪو نه.",
"label_name":"non_literal",
"type":"proverb"
},
{
"text":"ڪتي جا ڏند گڏھ جو ماس.",
"label_name":"non_literal",
"type":"metaphor"
}