License:
CC-BY-SA-4.0
Steward:
CommunityDataset ID:
cmp6qb2al02inmp07dlajrjke
Task: NLG
Release Date: 5/15/2026
Format: CSV
Size: 17.13 MB
Share
Consisting of 64 538 human-written and machine-generated texts in Polish from various domains, ŚMIGIEL is a comprehensive resource for training and benchmarking Machine-Generated Text (MGT) detection systems focusing on Polish language. The dataset was originally created to for needs of the Shared Task 1 at PolEval 2025 (see it here: http://poleval.pl/tasks/task1) organized by The Linguistic Engineering (LE) Group (learn more at https://zil.ipipan.waw.pl/), part of the Department of Artificial Intelligence at the Institute of Computer Science, Polish Academy of Sciences [ IPI PAN (official site: https://ipipan.waw.pl/).
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlRestrictions/Special Constraints
ŚMIGIEL is not for you, if: - you seek to train and validate systems that define the task not as classification, but as a text-boundary problem. This definition often requires more nuanceced solutions and more granular data (see [Named-entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)). - you seek using the dataset to enhance systems that pose as humans with intention to deceive. This is considered a malicious use of the dataset.
Forbidden Usage
- It is forbidden to use this dataset to train chatbots or large language models - You agree not to use the dataset for training or augmentation of systems meant to deceive humans
Intended Use
Direct Use The dataset is useful whenever a use case: - requires a big body of texts in Polish from various domains - is concerned with comparing generations coming from different LLM families, accross varying decoding strategies - needs a balanced representation of human-written and machine-generated texts to train and benchmark detection systems
| Column | Type | Values | Description |
|---|---|---|---|
text | string | — | The text fragment to classify. Prefix-stripped, whitespace-normalized, and truncated. |
model | string | human, bielik-sm, bielik-md, mistral-sm, mistral-md, plum, gemma, llama-sm, llama-lg | Source of the text — either human or the name of the LLM that produced it. |
strategy | string | human, greedy, sampling, beam_search, contrastive, dbs, llama_plum_sampling | Decoding strategy used during generation. Set to human for human-written texts. |
key | ClassLabel | 0 = human, 1 = machine | Binary classification label. |
For fuller model details, refer to models section. Strategy parameters are explained at strategies section.
The dataset is split between train, test_a and test_b.
| Split | Size | Description |
|---|---|---|
train | 35,763 | Human-written texts from 4 domains (Wikipedia, literature, social media, reviews) paired with completions from 7 models (3 small, 3 medium, 1 large). |
test_A | 10,343 | Interleave of: ½ alpha test split + ⅓ beta (Llama 3.3 70B only) + ⅓ gamma (8 models, adds news & government domains). Released with labels during the competition. |
test_B | 18,432 | Interleave of: ½ alpha test split + ⅔ beta + ⅔ gamma. Larger blind test set covering same domain/model mix as test_A. Labels released after the competition. |
The dataset was originally compiled for Shared Task 1 at PolEval 2025 to
provide training data for SUPERVISED subtask;
provide two sets of testing data for benchmarking submissions;
The original, human-written texts are coming from 12 distinct datasources. Collectively, they cover wide range of styles, lengths, and purposes of text-writing.
| Internal name | Domain | Description | Source |
|---|---|---|---|
wiki | wiki | Polish Wikipedia | chrisociepa/wikipedia-pl-20230401 |
plsc | literature | Polish scientific article abstracts | rafalposwiata/plsc |
coursebooks | literature | Polish open coursebooks | rafalposwiata/open-coursebooks-pl |
classics | literature | Polish classic literature corpus | dmitriilebedev/polish-corpus (Kaggle) |
twitter | social | Polish tweets (TwitterEmo) | clarin-pl/twitteremo |
wykop | social | Polish social media posts (BAN-PL, non-offensive only) | ZILiAT-NASK/BAN-PL |
polemo_hotels | reviews | Hotel reviews (PolEmo 2.0) | clarin-pl/polemo2-official |
polemo_medicine | reviews | Medical reviews (PolEmo 2.0) | clarin-pl/polemo2-official |
polemo_products | reviews | Product reviews (PolEmo 2.0) | clarin-pl/polemo2-official |
polemo_courses | reviews | Course reviews (PolEmo 2.0) | clarin-pl/polemo2-official |
allegro | reviews | Allegro marketplace reviews | PL-MTEB/allegro-reviews |
filmweb | reviews | Polish movie reviews (FilmwebPlus) | narolski/filmwebplus |
pmrd | reviews | Polish Movie Reviews Dataset | kamilsan/polish-movie-reviews-dataset |
wikinews | news | Polish Wikinews articles (custom scrape) | pl.wikinews.org |
gov | government | Polish parliamentary debates — Sejm + Senat (ParlaMint 5.0) | ParlaMint-PL, CLARIN.SI |
To help to balance the detaset in terms of linguistic features, the original sources were grouped into 6 genres or "domains". This balancing is two-fold - it's inward, as the origin of texts within a domain is balanced how much was possible, and outward, as the main part of the data, (train and alpha test subgroup) consists of equal share of the 4 domains. The two remaining ones, news articles and parlimentary hearings, were introduced as part of robust training subset.
To provide for versitile MGT, we used models coming from different families, and of varying sizes.
| Moniker | Size | Full name | HuggingFace |
|---|---|---|---|
llama-sm | small | Llama 3.1 8B Instruct | meta-llama/Llama-3.1-8B-Instruct |
bielik-sm | small | Bielik 7B Instruct v0.1 | speakleash/Bielik-7B-Instruct-v0.1 |
mistral-sm | small | Mistral 7B Instruct v0.3 | mistralai/Mistral-7B-Instruct-v0.3 |
bielik-md | medium | Bielik 11B v2.3 Instruct | speakleash/Bielik-11B-v2.3-Instruct |
mistral-md | medium | Mistral Nemo Instruct 2407 | mistralai/Mistral-Nemo-Instruct-2407 |
plum | medium | PLLuM 12B nc chat | CYFRAGOVPL/PLLuM-12B-nc-chat |
gemma | large | Gemma 3 27B Instruct | google/gemma-3-27b-it |
llama-lg | large | Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct |
We foster versitality of (generated) data by applying different decoding strategies. These strategies condition how "next token candidates" or strings of thereof are ultimately selected by the model. Below we provide their rundown, together with how they translate into parametrs of model inference's call.
| Strategy | Full name | Parameters | Reference |
|---|---|---|---|
greedy | Greedy decoding | do_sample=False | HF docs |
sampling | Multinomial sampling | do_sample=True, num_beams=1 | HF docs |
beam_search | Beam search | num_beams=2 | HF docs |
contrastive | Contrastive search | penalty_alpha=0.6, top_k=4 | Su et al., 2022 |
dbs | Diverse beam search | num_beams=6, num_beam_groups=3, diversity_penalty=1.0 | Vijayakumar et al., 2018 |
llama_plum_sampling | Temperature sampling | do_sample=True, temperature=0.6, top_p=0.9 | — (custom config for llama-lg and plum) |
Test A and Test B are both composed out of thee distinct subsets of data - alpha, beta, and gamma. The two tests differ in proportions of sampled data.
| Subset | Data | Models | Examples | Share in test_a | Share in test_b |
|---|---|---|---|---|---|
| alpha | old (4 domains) | all 7 base models + human | 4 505 | ½ | ½ |
| beta | old (4 domains) | llama-lg only + human | 4 356 | ⅓ | ⅔ |
| gamma | new (news + gov) | all 8 models + human | 19 914 | ⅓ | ⅔ |
alpha is simply the test split of the base postprocessing run
beta is a result of processing human texts from the dev split with a single model (llama-lg, Llama 3.3 70B)
gamma uses all the available models to process data from unseen datasets (the data was not published before)
The subsets contribute to the two resulting test groups through a round-robin assignment method. While the contribution of alpha is equal in both, the larger Test B gets more content from beta and gamma.
The source data comes from open-sourced corpora authors as well as researchers at IPI PAN to whom we are deeply indepted.
The usernames in texts coming from social media have been anonimized. To our best knowledge, the data does not contain any other personal and / or sensitive information.
The source data is processed to produce prompts to guide machine generation. The generations then further filtrated and sampled from. For details on the processes please refer to our repo.
@article{
title={Śmigiel Dataset: Laying Foundations for Investigating Machine-Generated Text Detection in Polish},
author={Jakub Strebeyko, Alina Wróblewska, Piotr Przybyła},
journal={LREC 2025},
year={2025}
}
Designed by: Piotr Przybyła, PhD and Alina Wróblewska, PhD
Compiled and Curated by: JJ Strebeyko
Language(s) (NLP): Polish
License: Creative Commons Attribution 4.0
Repository: https://github.com/JStrebeyko/code-of-smigiel
Paper: PolEval 2025 Task 1 Śmigiel: Spotting Machine-Generated Text from LLMs for Polish
Resources: Zenodo
Dataset creation was the Ramón y Cajal grant RYC2024-050327-I, funded by the Spanish State Research Agency (MI-CIU/AEI/10.13039/501100011033) and by the European Social Fund Plus (ESF+) of the European Union. We also gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019.