License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmr54pbgw01omnt07h9vzqrlb
Task: LM
Release Date: 7/3/2026
Format: TXT
Size: 321.00 MB
Share
SdCorpus is a comprehensive monolingual text corpus for the Sindhi language, comprising 6,246,432 sentences and 142,175,661 tokens. The corpus was crawled from diverse publicly available web resources, including news portals and literary websites, to provide broad linguistic coverage of modern written Sindhi. SdCorpus is designed to support research and development in NLP, particularly for low-resource languages. It is suitable for pretraining transformer-based language models such as BERT, RoBERTa, and DistilBERT, as well as continued pretraining of multilingual foundation models, including SmolLM and Meta's Llama models. Additionally, the corpus can be used for language modeling, instruction fine-tuning, text generation, and the development of conversational AI and other Sindhi language applications.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
Restrictions/Special Constraints
This dataset is intended for research, educational, and lawful NLP applications. Users must comply with the dataset license and all applicable laws and regulations.
Forbidden Usage
This dataset must not be used for unlawful, harmful, or malicious activities, nor may it be redistributed in violation of its license or used to generate misleading or deceptive content.
Ethical Review
This dataset is intended for research and educational use in Sindhi NLP, including language modeling, pretraining, and other monolingual language processing tasks. Any commercial use requires prior permission from the dataset creators.
Intended Use
SdCorpus is intended for research and development in NLP for the Sindhi language. It can be used for pretraining and continued pretraining of language models, language modeling, text generation, representation learning, and other downstream NLP tasks involving monolingual Sindhi text.
Sindhi (سنڌي) is an Indo-Aryan language spoken primarily in Pakistan and India. Despite its rich literary heritage, it remains a low-resource language for Natural Language Processing (NLP). SdCorpus provides a large-scale monolingual textual resource to support language modeling, representation learning, and other NLP applications.
Perso-Arabic Script (Sindhi)
ا، ب، ٻ، ڀ، پ، ت، ٿ، ٽ، ٺ، ث، ج، ڄ، جھ، ڃ، چ، ڇ، ح، خ، د، ڌ، ڏ، ڊ، ڍ، ذ، ر، ڙ، ز، س، ش، ص، ض، ط، ظ، ع، غ، ف، ڦ، ق، ڪ، ک، گ، ڳ، ڱ، ل، م، ن، ڻ، و، ھ، ء، ي، ه
Sindhi Corpus/
│
└── Sindhi corpus.txt
| Field | Details |
|---|---|
| Dataset Name | Sindhi Corpus (SdCorpus) |
| Language | Sindhi (سنڌي) |
| ISO 639-3 | snd |
| Script | Perso-Arabic Script (Sindhi, Unicode) |
| Domain | General-purpose Monolingual Text Corpus |
| Task Type | Language Modeling / NLP Pretraining |
| Encoding | UTF-8 |
| Format | TXT |
| Total Sentences | 6,246,432 |
| Total Tokens | 142,175,661 |