Sindhi Corpus

Description

SdCorpus is a comprehensive monolingual text corpus for the Sindhi language, comprising 6,246,432 sentences and 142,175,661 tokens. The corpus was crawled from diverse publicly available web resources, including news portals and literary websites, to provide broad linguistic coverage of modern written Sindhi. SdCorpus is designed to support research and development in NLP, particularly for low-resource languages. It is suitable for pretraining transformer-based language models such as BERT, RoBERTa, and DistilBERT, as well as continued pretraining of multilingual foundation models, including SmolLM and Meta's Llama models. Additionally, the corpus can be used for language modeling, instruction fine-tuning, text generation, and the development of conversational AI and other Sindhi language applications.

Specifics

Sindhi Corpus (SdCorpus)

Language

Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. Despite its rich literary heritage, it remains a low-resource language for Natural Language Processing (NLP). SdCorpus provides a large-scale monolingual textual resource to support language modeling, representation learning, and other NLP applications.

Script

Perso-Arabic Script (Sindhi)

ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه

Dataset Structure

Sindhi Corpus/
│
└── Sindhi corpus.txt

Metadata

Field	Details
Dataset Name	Sindhi Corpus (SdCorpus)
Language	Sindhi (سنڌي)
ISO 639-3	`snd`
Script	Perso-Arabic Script (Sindhi, Unicode)
Domain	General-purpose Monolingual Text Corpus
Task Type	Language Modeling / NLP Pretraining
Encoding	UTF-8
Format	TXT
Total Sentences	6,246,432
Total Tokens	142,175,661

Description

Specifics

Considerations

Processes

Metadata

Sindhi Corpus (SdCorpus)

Language

Script

Dataset Structure

Metadata

Sample Text