License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmr29jici02trmk07l5xlcsak
Task: MT
Release Date: 7/1/2026
Format: JSONL
Size: 2.50 MB
Share
A Sindhi–English parallel corpus comprising nearly 25,000 sentence pairs was developed to train a robust machine translation system. The corpus covers a wide range of domains, including literature, daily-life conversations, common greetings, shopping, travel, family, health, education, and telephone conversations. The dataset is provided in JSONL format and is suitable for fine-tuning transformer-based language models for machine translation and other cross-lingual natural language processing tasks. short desription for this.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is for research and educational use only and is intended for machine translation tasks. It may contain informal text, domain imbalance, and variable quality, so preprocessing and filtering are recommended..The user is supposed to get permission for commercial usage.
Forbidden Usage
The dataset must not be used for commercial purposes, deployment in production systems without evaluation, or any unlawful, harmful, or unethical applications.
Ethical Review
This dataset is intended for research and educational use in Sindhi–English machine translation and cross-lingual NLP. Any commercial use requires prior permission from the dataset creators.
Intended Use
This dataset is intended for use in developing and evaluating neural machine translation systems for Sindhi–English language pairs, as well as supporting research in cross-lingual natural language processing.
Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. It remains a low-resource language in NLP, especially for machine translation and cross-lingual tasks.
Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
Sindhi-English Dataset/
│
├── train.jsonl
├── test.jsonl
└── val.jsonl
└── README.md
| Field | Details |
|---|---|
| Dataset Name | Sindhi-English Parallel Corpus |
| Language | Sindhi (سنڌي) - English (en) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO / 639-3 | snd/ eng |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Format | JSONL (train, test, validation split) |
| Encoding | UTF-8 |
| Category | Count | Percentage |
|---|---|---|
| Literature | 24,910 | 68.01% |
| Daily Life | 5,657 | 15.45% |
| Family | 1,988 | 5.43% |
| Work | 685 | 1.87% |
| Health | 609 | 1.66% |
| Education | 578 | 1.58% |
| Phone Conversation | 575 | 1.57% |
| Travel | 570 | 1.56% |
| Greetings | 528 | 1.44% |
| Shopping | 526 | 1.44% |
The dataset is predominantly composed of Literature recordings (68.01%), followed by Daily Life (15.45%) and Family (5.43%). The remaining domains (Work, Health, Education, Phone Conversation, Travel, Greetings, and Shopping) each contribute between 1.4% and 1.9% of the dataset.
{"sd":"ٿوري دير بعد کيکڙو پنهنجي ٻر مان ٻاهر نڪتو، پنهنجي جهوني دوست ٻگهه پکيءَ کي ان حال ۾ ڏسي حيران ٿي ويو، خير ته آهي ڀاءُ ٻگهه پکي؟","en":"After a while, the crab came out of his burrow and was surprised to see his old friend, the crane, in that condition. 'Is everything alright, brother crane?'","category":"literature","sd_word_count":28,"en_word_count":28,"sd_char_count":128,"en_char_count":157}
{"sd":"جنهن جي ماني تنهن جي ڪائٺي","en":"He who provides the food, commands the respect.","category":"literature","sd_word_count":6,"en_word_count":8,"sd_char_count":26,"en_char_count":47}
{"sd":"ٻارڙن کي ڪهاڙين سان ڳڇا ڪيو هوندائين.","en":"He must have hacked the children with axes.","category":"literature","sd_word_count":7,"en_word_count":8,"sd_char_count":37,"en_char_count":43}