Sindh Line Publishers | Mozilla Data Collective

Description

The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

https://spdx.org/licenses/CC-BY-SA-4.0.html

Metadata

Overview

Dataset name: Sindh Line Publisher Sindhi Newspaper Corpus (2024–2025)
Language: Sindhi (may include some Urdu/English in finance, names, ads)
Location / publisher: Karachi, Pakistan (daily publication)
Time coverage: 2024–2025
Size: ~1.029 million tokens
Content included: complete newspaper text — headlines, editorials, finance news, advertisements

Language

Sindhi (سِنڌِي‎, Sindhī, [sɪndʱi} is an Indo-Aryan language spoken by the Sindhi in the province of Sindh, Pakistan. It is the official language of the province and constitutes the mother tongue of over 34 million people in Pakistan and 1.7 million people in India.

Script

ا ب ٻ ڀ ت ٿ ٽ ٺ ث پ ج ڄ جھ ڃ چ ڇ ح خ د ڌ ڏ ڊ ڍ ذ ر ڙ ز س ش ص ض ط ظ ع غ ف ڦ ق ڪ ک گ ڳ گھ ڱ ل م ن ڻ و ھ ء ي

Sample

هن جڏهن ميثاق معيشت جي ڳالهه ڪئي ته کيس توهين سان رد ڪيو ويو، اڄ به ميثاق معيشت لاءِ تيار آهن، 9 مهينن ۾ اسان وڏين چئلينجن کي منهن ڏنو

روئڻ جو ڪو به فائدو ناهي،پاليسي ريٽ ۾ وڌيڪ گهٽتائي ڪئي وڃي، مان چاهيان ٿو ته ٽيڪسن کي گهٽايو وڃي ته جيئن ٽيڪس چوري نه ٿئي: خطاب

اڏار پاڪستان جو محور برآمداتي ترقي آهي، معاشي استحڪام اچي چڪو آهي، هاڻي اسان کي ترقي ڏانهن وڌڻو آهي، برآمدات وڌائڻ لاءِ ڪاروبار دوست ماحول پيدا ڪرڻو پوندو

Why this dataset

A modern, real-world Sindhi news corpus for Sindhi NLP, linguistic research, and digital preservation, covering multiple registers (formal editorials → mixed-style ads).

Data Composition

What’s included: headlines, editorials, finance/business items, advertisements (complete textual content)
Granularity: one combined corpus file (all issues/content concatenated in a single .txt file)

Processing (recommended)

Single combined TXT file: Keep the original file as Raw (unchanged) and create a second Clean version derived from it.
- Raw: the full newspaper text as collected (one combined .txt file)
- Clean: preprocessing on the combined file, including:
  - remove or normalize alphanumeric strings, extra symbols, and non-Sindhi characters (as needed)
  - Unicode normalization + whitespace/punctuation cleanup
  - optional removal of repeated boilerplate (if present)
  - sentence segmentation / parsing to create training-ready units
Optional (recommended): redact or mask PII that may appear in advertisements/classifieds (phone numbers, emails, addresses) before release or model training.

Note: Since the file contains the entire newspaper content in one text, it may require the above cleaning and sentence parsing to be used reliably for training purposes.

Ethics & privacy

Ads/classifieds may contain personal details. Avoid releasing unredacted PII; don’t enable doxxing/targeting uses.

Limitations

Written-news register (not speech), Karachi-centric coverage, ads can skew vocabulary, OCR may introduce systematic errors.