śmigiel - machine generated text detection

Fields

Column	Type	Values	Description
`text`	`string`	—	The text fragment to classify. Prefix-stripped, whitespace-normalized, and truncated.
`model`	`string`	`human`, `bielik-sm`, `bielik-md`, `mistral-sm`, `mistral-md`, `plum`, `gemma`, `llama-sm`, `llama-lg`	Source of the text — either `human` or the name of the LLM that produced it.
`strategy`	`string`	`human`, `greedy`, `sampling`, `beam_search`, `contrastive`, `dbs`, `llama_plum_sampling`	Decoding strategy used during generation. Set to `human` for human-written texts.
`key`	`ClassLabel`	`0` = `human`, `1` = `machine`	Binary classification label.

For fuller model details, refer to models section. Strategy parameters are explained at strategies section.

Splits

The dataset is split between train, test_a and test_b.

Split	Size	Description
`train`	35,763	Human-written texts from 4 domains (Wikipedia, literature, social media, reviews) paired with completions from 7 models (3 small, 3 medium, 1 large).
`test_A`	10,343	Interleave of: ½ alpha test split + ⅓ beta (Llama 3.3 70B only) + ⅓ gamma (8 models, adds news & government domains). Released with labels during the competition.
`test_B`	18,432	Interleave of: ½ alpha test split + ⅔ beta + ⅔ gamma. Larger blind test set covering same domain/model mix as test_A. Labels released after the competition.

Curation Rationale

The dataset was originally compiled for Shared Task 1 at PolEval 2025 to

provide training data for SUPERVISED subtask;
provide two sets of testing data for benchmarking submissions;

Source Data

The original, human-written texts are coming from 12 distinct datasources. Collectively, they cover wide range of styles, lengths, and purposes of text-writing.

Internal name	Domain	Description	Source
`wiki`	wiki	Polish Wikipedia	chrisociepa/wikipedia-pl-20230401
`plsc`	literature	Polish scientific article abstracts	rafalposwiata/plsc
`coursebooks`	literature	Polish open coursebooks	rafalposwiata/open-coursebooks-pl
`classics`	literature	Polish classic literature corpus	dmitriilebedev/polish-corpus (Kaggle)
`twitter`	social	Polish tweets (TwitterEmo)	clarin-pl/twitteremo
`wykop`	social	Polish social media posts (BAN-PL, non-offensive only)	ZILiAT-NASK/BAN-PL
`polemo_hotels`	reviews	Hotel reviews (PolEmo 2.0)	clarin-pl/polemo2-official
`polemo_medicine`	reviews	Medical reviews (PolEmo 2.0)	clarin-pl/polemo2-official
`polemo_products`	reviews	Product reviews (PolEmo 2.0)	clarin-pl/polemo2-official
`polemo_courses`	reviews	Course reviews (PolEmo 2.0)	clarin-pl/polemo2-official
`allegro`	reviews	Allegro marketplace reviews	PL-MTEB/allegro-reviews
`filmweb`	reviews	Polish movie reviews (FilmwebPlus)	narolski/filmwebplus
`pmrd`	reviews	Polish Movie Reviews Dataset	kamilsan/polish-movie-reviews-dataset
`wikinews`	news	Polish Wikinews articles (custom scrape)	pl.wikinews.org
`gov`	government	Polish parliamentary debates — Sejm + Senat (ParlaMint 5.0)	ParlaMint-PL, CLARIN.SI

Domains

To help to balance the detaset in terms of linguistic features, the original sources were grouped into 6 genres or "domains". This balancing is two-fold - it's inward, as the origin of texts within a domain is balanced how much was possible, and outward, as the main part of the data, (train and alpha test subgroup) consists of equal share of the 4 domains. The two remaining ones, news articles and parlimentary hearings, were introduced as part of robust training subset.

Models

To provide for versitile MGT, we used models coming from different families, and of varying sizes.

Moniker	Size	Full name	HuggingFace
`llama-sm`	small	Llama 3.1 8B Instruct	meta-llama/Llama-3.1-8B-Instruct
`bielik-sm`	small	Bielik 7B Instruct v0.1	speakleash/Bielik-7B-Instruct-v0.1
`mistral-sm`	small	Mistral 7B Instruct v0.3	mistralai/Mistral-7B-Instruct-v0.3
`bielik-md`	medium	Bielik 11B v2.3 Instruct	speakleash/Bielik-11B-v2.3-Instruct
`mistral-md`	medium	Mistral Nemo Instruct 2407	mistralai/Mistral-Nemo-Instruct-2407
`plum`	medium	PLLuM 12B nc chat	CYFRAGOVPL/PLLuM-12B-nc-chat
`gemma`	large	Gemma 3 27B Instruct	google/gemma-3-27b-it
`llama-lg`	large	Llama 3.3 70B Instruct	meta-llama/Llama-3.3-70B-Instruct

Strategies

We foster versitality of (generated) data by applying different decoding strategies. These strategies condition how "next token candidates" or strings of thereof are ultimately selected by the model. Below we provide their rundown, together with how they translate into parametrs of model inference's call.

Strategy	Full name	Parameters	Reference
`greedy`	Greedy decoding	`do_sample=False`	HF docs
`sampling`	Multinomial sampling	`do_sample=True, num_beams=1`	HF docs
`beam_search`	Beam search	`num_beams=2`	HF docs
`contrastive`	Contrastive search	`penalty_alpha=0.6, top_k=4`	Su et al., 2022
`dbs`	Diverse beam search	`num_beams=6, num_beam_groups=3, diversity_penalty=1.0`	Vijayakumar et al., 2018
`llama_plum_sampling`	Temperature sampling	`do_sample=True, temperature=0.6, top_p=0.9`	— (custom config for `llama-lg` and `plum`)

Composition

Test A and Test B are both composed out of thee distinct subsets of data - alpha, beta, and gamma. The two tests differ in proportions of sampled data.

Subset	Data	Models	Examples	Share in `test_a`	Share in `test_b`
alpha	old (4 domains)	all 7 base models + human	4 505	½	½
beta	old (4 domains)	`llama-lg` only + human	4 356	⅓	⅔
gamma	new (news + gov)	all 8 models + human	19 914	⅓	⅔

alpha is simply the test split of the base postprocessing run
beta is a result of processing human texts from the dev split with a single model (llama-lg, Llama 3.3 70B)
gamma uses all the available models to process data from unseen datasets (the data was not published before)

The subsets contribute to the two resulting test groups through a round-robin assignment method. While the contribution of alpha is equal in both, the larger Test B gets more content from beta and gamma.

FAQ

Who are the source data producers?

The source data comes from open-sourced corpora authors as well as researchers at IPI PAN to whom we are deeply indepted.

Personal and Sensitive Information

The usernames in texts coming from social media have been anonimized. To our best knowledge, the data does not contain any other personal and / or sensitive information.

Data Collection and Processing

The source data is processed to produce prompts to guide machine generation. The generations then further filtrated and sampled from. For details on the processes please refer to our repo.

Citation

@article{
  title={Śmigiel Dataset: Laying Foundations for Investigating Machine-Generated Text Detection in Polish},
  author={Jakub Strebeyko, Alina Wróblewska, Piotr Przybyła},
  journal={LREC 2025},
  year={2025}
}

Designed by: Piotr Przybyła, PhD and Alina Wróblewska, PhD
Compiled and Curated by: JJ Strebeyko
Language(s) (NLP): Polish
License: Creative Commons Attribution 4.0
Repository: https://github.com/JStrebeyko/code-of-smigiel
Paper: PolEval 2025 Task 1 Śmigiel: Spotting Machine-Generated Text from LLMs for Polish
Resources: Zenodo

Acknowledgements

Dataset creation was the Ramón y Cajal grant RYC2024-050327-I, funded by the Spanish State Research Agency (MI-CIU/AEI/10.13039/501100011033) and by the European Social Fund Plus (ESF+) of the European Union. We also gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2025/018019.