Biomedical Urdu Named Entity Recognition (BioUNER)

Description

The Biomedical Urdu Named Entity Recognition (BioUNER) dataset is a gold-standard annotated for identifying and classifying biomedical entities in Urdu text. Urdu is one of the most widely spoken languages in South Asia with limited publicly available resources for tasks such as named entity recognition. BioUNER addresses this gap by providing manually annotated biomedical text collected from diverse healthcare-related sources, including Urdu news articles, hospital websites, and health blogs. The dataset contains approximately 153,000 annotated tokens and was labeled by three native Urdu annotators with familiarity in the medical domain using the Doccano annotation platform. Annotation quality was assessed through inter-annotator agreement, achieving a score of 0.78, which validates the dataset as a reliable benchmark resource. The dataset is intended to support the development and evaluation of biomedical NER systems for Urdu and can facilitate downstream applications such as clinical information extraction, medical search, healthcare question answering, and biomedical text mining. By making this resource openly available, BioUNER contributes to advancing Urdu language technologies and enabling more inclusive biomedical AI research for low-resource languages.

Biomedical Urdu Named Entity Recognition (BioUNER) Dataset

Language

Urdu (اردو) is an Indo-Aryan language of the Indo-European language family and serves as the national language of Pakistan. It is spoken by more than 200 million speakers worldwide, primarily in Pakistan and India, as well as by diaspora communities across the Middle East, Europe, and North America. Urdu has a rich literary and scholarly tradition and is widely used in education, media, healthcare, and government. Despite its widespread use, Urdu remains underrepresented in biomedical natural language processing, with limited annotated resources available for tasks such as named entity recognition and information extraction.

Script

Perso-Arabic Script (Urdu)

ا، ب، پ، ت، ٹ، ث، ج، چ، ح، خ، د، ڈ، ذ، ر، ڑ، ز، ژ، س، ش، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، و، ہ، ھ، ء، ی، ے

Dataset Structure

BioUNER - Biomedical Urdu Named Entity Recognition/
│
└── BioUNER-Dataset

Metadata

Field	Details
Dataset Name	Biomedical Urdu Named Entity Recognition Dataset (BioUNER)
Language	Urdu (اردو)
Language Family	Indo-European — Indo-Aryan Branch
ISO 639-1 / 639-3	`ur` / `urd`
Script	Perso-Arabic Script (Urdu, Unicode)
Annotation Tool	Doccano
Encoding	UTF-8


## Sample Text
پوری,O
کرنے,O
کے,O
لئے,O
ھدائیات۔,O
یہ,O
سیکھئے,O
کہ,O
آئرن,O
کی,O
کمی,O
کہتے,O
ھیں,O
اور,O
آپ,O
کو,O
اپنے,O
بچے,O
کی,O

Biomedical Urdu Named Entity Recognition (BioUNER)

Description

Specifics

Considerations

Processes

Metadata

Biomedical Urdu Named Entity Recognition (BioUNER) Dataset

Language

Script

Dataset Structure

Metadata