License:
CC-BY-NC-4.0
Steward:
CommunityDataset ID:
cmqnyjp9j0407mm07nolj3h4h
Task: NLP
Release Date: 6/21/2026
Format: CSV
Size: 336.69 KB
Share
The Biomedical Urdu Named Entity Recognition (BioUNER) dataset is a gold-standard annotated for identifying and classifying biomedical entities in Urdu text. Urdu is one of the most widely spoken languages in South Asia with limited publicly available resources for tasks such as named entity recognition. BioUNER addresses this gap by providing manually annotated biomedical text collected from diverse healthcare-related sources, including Urdu news articles, hospital websites, and health blogs. The dataset contains approximately 153,000 annotated tokens and was labeled by three native Urdu annotators with familiarity in the medical domain using the Doccano annotation platform. Annotation quality was assessed through inter-annotator agreement, achieving a score of 0.78, which validates the dataset as a reliable benchmark resource. The dataset is intended to support the development and evaluation of biomedical NER systems for Urdu and can facilitate downstream applications such as clinical information extraction, medical search, healthcare question answering, and biomedical text mining. By making this resource openly available, BioUNER contributes to advancing Urdu language technologies and enabling more inclusive biomedical AI research for low-resource languages.
Licensing
Creative Commons Attribution Non Commercial 4.0 International (CC-BY-NC-4.0)
https://spdx.org/licenses/CC-BY-NC-4.0.htmlRestrictions/Special Constraints
This dataset is provided for research and educational use in biomedical NLP. Users must comply with applicable legal and ethical standards and provide appropriate attribution when using the dataset. For commercial usage user is supposed to get permission.
Forbidden Usage
Any use of this dataset must respect ethical AI principles and applicable data protection requirements.
Intended Use
The dataset is intended to support the development and evaluation of biomedical NER systems for Urdu and can facilitate downstream applications such as clinical information extraction, medical search, healthcare question answering, and biomedical text mining.
Urdu (اردو) is an Indo-Aryan language of the Indo-European language family and serves as the national language of Pakistan. It is spoken by more than 200 million speakers worldwide, primarily in Pakistan and India, as well as by diaspora communities across the Middle East, Europe, and North America. Urdu has a rich literary and scholarly tradition and is widely used in education, media, healthcare, and government. Despite its widespread use, Urdu remains underrepresented in biomedical natural language processing, with limited annotated resources available for tasks such as named entity recognition and information extraction.
Perso-Arabic Script (Urdu)
ا، ب، پ، ت، ٹ، ث، ج، چ، ح، خ، د، ڈ، ذ، ر، ڑ، ز، ژ، س، ش، ص، ض، ط، ظ، ع، غ، ف، ق، ک، گ، ل، م، ن، ں، و، ہ، ھ، ء، ی، ے
BioUNER - Biomedical Urdu Named Entity Recognition/
│
└── BioUNER-Dataset
| Field | Details |
|---|---|
| Dataset Name | Biomedical Urdu Named Entity Recognition Dataset (BioUNER) |
| Language | Urdu (اردو) |
| Language Family | Indo-European — Indo-Aryan Branch |
| ISO 639-1 / 639-3 | ur / urd |
| Script | Perso-Arabic Script (Urdu, Unicode) |
| Annotation Tool | Doccano |
| Encoding | UTF-8 |
## Sample Text
پوری,O
کرنے,O
کے,O
لئے,O
ھدائیات۔,O
یہ,O
سیکھئے,O
کہ,O
آئرن,O
کی,O
کمی,O
کہتے,O
ھیں,O
اور,O
آپ,O
کو,O
اپنے,O
بچے,O
کی,O