Sindhi to English Parallel Corpus

Sindhi–English Parallel Corpus

Language

Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. It remains a low-resource language in NLP, especially for machine translation and cross-lingual tasks.

Script

Perso-Arabic Script (Sindhi) ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه

Dataset Structure

Sindhi-English Dataset/
│
├── train.jsonl
├── test.jsonl
└── val.jsonl
└── README.md

Metadata

Field	Details
Dataset Name	Sindhi-English Parallel Corpus
Language	Sindhi (سنڌي) - English (en)
Language Family	Indo-European — Indo-Aryan Branch
ISO / 639-3	`snd`/ `eng`
Script	Perso-Arabic Script (Sindhi, Unicode)
Format	JSONL (train, test, validation split)
Encoding	UTF-8

Domain Distribution

Category	Count	Percentage
Literature	24,910	68.01%
Daily Life	5,657	15.45%
Family	1,988	5.43%
Work	685	1.87%
Health	609	1.66%
Education	578	1.58%
Phone Conversation	575	1.57%
Travel	570	1.56%
Greetings	528	1.44%
Shopping	526	1.44%

The dataset is predominantly composed of Literature recordings (68.01%), followed by Daily Life (15.45%) and Family (5.43%). The remaining domains (Work, Health, Education, Phone Conversation, Travel, Greetings, and Shopping) each contribute between 1.4% and 1.9% of the dataset.

Sample Text

{"sd":"ٿوري دير بعد کيکڙو پنهنجي ٻر مان ٻاهر نڪتو، پنهنجي جهوني دوست ٻگهه پکيءَ کي ان حال ۾ ڏسي حيران ٿي ويو، خير ته آهي ڀاءُ ٻگهه پکي؟","en":"After a while, the crab came out of his burrow and was surprised to see his old friend, the crane, in that condition. 'Is everything alright, brother crane?'","category":"literature","sd_word_count":28,"en_word_count":28,"sd_char_count":128,"en_char_count":157}
{"sd":"جنهن جي ماني تنهن جي ڪائٺي","en":"He who provides the food, commands the respect.","category":"literature","sd_word_count":6,"en_word_count":8,"sd_char_count":26,"en_char_count":47}
{"sd":"ٻارڙن کي ڪهاڙين سان ڳڇا ڪيو هوندائين.","en":"He must have hacked the children with axes.","category":"literature","sd_word_count":7,"en_word_count":8,"sd_char_count":37,"en_char_count":43}

Description

Specifics

Considerations

Processes

Metadata

Sindhi–English Parallel Corpus

Language

Script

Dataset Structure

Metadata

Domain Distribution

Sample Text