License:
CC-BY-SA-4.0
Steward:
MDC CuratorsTask: NLP
Release Date: 3/24/2026
Format: TSV
Size: 57.35 KB
Share
This dataset consists of sentences tagged as offensive-language in the version 25.0 release of Mozilla Common Voice in Catalan. The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
These sentences must not be used to generate offensive content.
Forbidden Usage
It is forbidden to use to generate offensive content.
Intended Use
The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
This dataset consists of sentences tagged as offensive-language in the version 25.0 release
of Mozilla Common Voice in Catalan.
The sentences are provided with the aim that they promote the development of offensive language detection in Catalan.
There are a total of 770 lines consisting of 8,219 tokens.
Many of them were seemingly generated based on templates involving a place name. It is also noteworthy that many of them contain grammatical errors Els negras, as is typical of the genre. The majority express bigotry towards Muslims and Black people, particularly immigrants, but there are also some that express bigotry towards immigrants from Latin America.
These sentences were uploaded via the "single sentence" upload facility in Mozilla Common Voice and are licensed CC-0.
The dataset contains a single file, offensive-language.tsv which contains four columns:
sentence_id: The hash of the sentence
sentence: The sentence text
locale: The locale (in this case ca -- Catalan)
category: The category (in this case offensive-language)