License:
CC-BY-SA-4.0
Steward:
Sindh Line PublishersTask: NLP
Release Date: 1/5/2026
Format: TXT
Size: 2.22 MB
Share
The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlDataset name: Sindh Line Publisher Sindhi Newspaper Corpus (2024–2025)
Language: Sindhi (may include some Urdu/English in finance, names, ads)
Location / publisher: Karachi, Pakistan (daily publication)
Time coverage: 2024–2025
Size: ~1.029 million tokens
Content included: complete newspaper text — headlines, editorials, finance news, advertisements
Sindhi (سِنڌِي, Sindhī, [sɪndʱi} is an Indo-Aryan language spoken by the Sindhi in the province of Sindh, Pakistan. It is the official language of the province and constitutes the mother tongue of over 34 million people in Pakistan and 1.7 million people in India.
ا ب ٻ ڀ ت ٿ ٽ ٺ ث پ ج ڄ جھ ڃ چ ڇ ح خ د ڌ ڏ ڊ ڍ ذ ر ڙ ز س ش ص ض ط ظ ع غ ف ڦ ق ڪ ک گ ڳ گھ ڱ ل م ن ڻ و ھ ء ي
هن جڏهن ميثاق معيشت جي ڳالهه ڪئي ته کيس توهين سان رد ڪيو ويو، اڄ به ميثاق معيشت لاءِ تيار آهن، 9 مهينن ۾ اسان وڏين چئلينجن کي منهن ڏنو
روئڻ جو ڪو به فائدو ناهي،پاليسي ريٽ ۾ وڌيڪ گهٽتائي ڪئي وڃي، مان چاهيان ٿو ته ٽيڪسن کي گهٽايو وڃي ته جيئن ٽيڪس چوري نه ٿئي: خطاب
اڏار پاڪستان جو محور برآمداتي ترقي آهي، معاشي استحڪام اچي چڪو آهي، هاڻي اسان کي ترقي ڏانهن وڌڻو آهي، برآمدات وڌائڻ لاءِ ڪاروبار دوست ماحول پيدا ڪرڻو پوندو
A modern, real-world Sindhi news corpus for Sindhi NLP, linguistic research, and digital preservation, covering multiple registers (formal editorials → mixed-style ads).
What’s included: headlines, editorials, finance/business items, advertisements (complete textual content)
Granularity: one combined corpus file (all issues/content concatenated in a single .txt file)
Single combined TXT file: Keep the original file as Raw (unchanged) and create a second Clean version derived from it.
Raw: the full newspaper text as collected (one combined .txt file)
Clean: preprocessing on the combined file, including:
remove or normalize alphanumeric strings, extra symbols, and non-Sindhi characters (as needed)
Unicode normalization + whitespace/punctuation cleanup
optional removal of repeated boilerplate (if present)
sentence segmentation / parsing to create training-ready units
Optional (recommended): redact or mask PII that may appear in advertisements/classifieds (phone numbers, emails, addresses) before release or model training.
Note: Since the file contains the entire newspaper content in one text, it may require the above cleaning and sentence parsing to be used reliably for training purposes.
Ads/classifieds may contain personal details. Avoid releasing unredacted PII; don’t enable doxxing/targeting uses.
Written-news register (not speech), Karachi-centric coverage, ads can skew vocabulary, OCR may introduce systematic errors.