Task: MT
Release Date: 6/30/2026
Format: JSONL
Size: 201.48 KB
Share
An evaluation benchmark that tests whether machine translation keeps inline localization assets intact: XLIFF and HTML markup, printf and brace placeholders, template variables, ICU MessageFormat, inline Markdown, and do-not-translate spans. English UI strings are paired with human translations from permissively licensed, key-aligned localization catalogs (Apache OpenOffice, Chromium, AOSP, DSpace, Godot, Flutter Gallery): 599 source strings x 9 target languages (ca/es/fr/it/pt-PT/de/nl/pl/ru) = 5,391 records with the same class profile in every language. Scoring is fully automatic: for each record, deterministic checks compare the assets in the system output against the source and report which of nine error categories occur; a record passes when none does, and a system's score is its pass rate, reported overall and per category. Reference labels are generated automatically from the source, not human-annotated. No training set is shipped; the dev split is a small labelled set for optional few-shot prompting or sanity checks. The open layer contains the dev split (with references), the test inputs (references withheld), and a contrastive pack of (correct, damaged) minimal pairs; the hidden split is not distributed. The scoring code is open-source at https://github.com/Prompsit/prompsit-mdc - see DATASHEET.md inside the archive for construction method, checks, and limits.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Test-split gold withheld; hidden split not distributed.
Forbidden Usage
No additional legal restriction beyond CC-BY-4.0. For benchmark integrity, disclose any training or tuning on the open test inputs and do not report such runs as uncontaminated official benchmark results.
Ethical Review
Public-domain / permissively-licensed UI strings only; no personal data.
Intended Use
Evaluation of machine-translation systems for inline asset integrity across nine languages. No training set is shipped.
Version 1.0 | Schema 0.1 | Scoring rule: every inline asset in the source must appear in the translation byte-for-byte intact
English UI strings carry inline localization assets: XLIFF and HTML markup, printf and brace placeholders, template variables, inline ICU MessageFormat, inline Markdown, and do-not-translate spans. This dataset tests whether machine translation keeps those assets intact instead of dropping or corrupting them. Each source string is paired with human translations into nine languages, drawn from permissively licensed, key-aligned localization catalogs.
Reference labels are generated automatically and deterministically from the source by rule-based parsers; they are not human annotations. Every released reference passes the scoring script (5,391/5,391).
Download the open layer from this page and unpack it: data/dev.jsonl
(inputs plus reference translations), data/test.input.jsonl (inputs
only), data/contrastive.dev.jsonl.
Translate the source strings with the MT system you want to evaluate - any system works; no special integration is required.
Score the outputs with the open-source scoring script - score_item(...)
in https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d1-inline-asset-integrity/build
(validators.py). Each record passes or fails per error category; your score is
the pass rate.
To verify your harness first, score the dev references themselves: they must pass 100%. For an official, comparable score on the withheld test and hidden splits, contact info@prompsit.com.
The task: translate each UI string from English into the target language while keeping every inline asset structurally intact.
Scoring is fully automatic; no human judges are involved. For every record, a deterministic scoring script (open-source: https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d1-inline-asset-integrity/build) inspects the system output and reports which error categories occur (see the Error categories table below). A record passes if no error category occurs. The score of a system is its pass rate: the fraction of records that pass, reported overall and per error category. The checks compare the assets in the output against the source string, so no reference translation of the output is needed.
The scoring entry point is score_item(...) in validators.py there; the README and AGENTS.md at the repository root
walk through scoring your own outputs.
The seven automated checks are: asset inventory, placeholder_syntax,
nesting, icu_syntax, order, attributes, and verbatim (for
do-not-translate spans).
Domain: software localization and machine-translation quality evaluation for UI and resource strings with inline assets.
Size: 5,391 records; the open archive contains the dev split (with references), the test inputs, the contrastive pairs, README, this datasheet, third-party notices, manifest, and Croissant metadata.
Structure: JSONL records with source and target text, target language, asset
inventory, asset positions (ref_tag_positions, located by verbatim search),
expected invariants, error-category tags, split, and provenance.
License: CC-BY-4.0 open layer; upstream attributions in
THIRD_PARTY_NOTICES.md.
Intended use: evaluation of MT systems for inline asset preservation across nine target languages. No training set is shipped.
Ethical review: permissively licensed or public UI strings only; no personal data.
Contact: Prompsit Language Engineering, info@prompsit.com.
Dataset on MDC (download the open layer): https://mozilladatacollective.com/datasets/cmr0mng9z01bsmk07cuqltz81
The test references and the hidden split are withheld, so a score on those splits reflects performance on unseen inputs rather than answers a system could have memorised. For an independent evaluation of an MT or LLM system on the withheld splits, contact Prompsit at info@prompsit.com.
| Split | File | Records | What it contains |
|---|---|---|---|
| Open dev | data/dev.jsonl | 639 | inputs plus the reference translation and labels |
| Test inputs | data/test.input.jsonl | 3,717 | inputs only; references withheld |
| Test references | data/test.ref.jsonl | 3,717 | withheld, retained by Prompsit |
| Hidden | - | 1,035 | never distributed |
| Contrastive | data/contrastive.dev.jsonl | 1,279 | (correct, damaged) minimal pairs from the dev split, each pair separated by the scoring script |
599 sources x 9 languages = 5,391 records. Split ~10% dev / 70% test / 20%
hidden (sources: 71 / 413 / 115), stratified by asset-class profile and
partitioned by item_id, so a source and its nine translations never cross
splits.
No training set is shipped. The dev split is a small labelled set for optional few-shot prompting or sanity checks; it is not required to run the benchmark. The test references and the entire hidden split are withheld by Prompsit so that systems cannot be tuned to them.
en into ca, es, fr, it, pt-PT, de, nl, pl, ru. Every source is present in
all nine languages with the same asset-class profile, so per-language scores are
directly comparable.
Every record is tagged with the error categories it can expose; the scoring script detects each category with a dedicated automated check. The scoring script does not assign per-category severity: a record that triggers any category fails.
| Error category | What it means | Records |
|---|---|---|
missing_asset | an inline asset from the source is absent from the output | 4,356 |
extra_asset | the output contains an asset the source does not have | 4,356 |
corrupted_syntax | an asset survives but its markup or placeholder syntax is damaged | 4,356 |
invalid_nesting | paired tags overlap or close in the wrong order | 2,772 |
moved_paired_tag | a paired tag moved so it no longer wraps the content it wrapped in the source | 2,772 |
wrong_order | assets appear in an order that breaks a required ordering (for example positional placeholders) | 2,565 |
lost_attribute | a tag survives but loses an attribute it had in the source (href, id, ...) | 2,304 |
broken_icu | an ICU MessageFormat structure is damaged (missing branch, broken braces) | 1,467 |
dnt_violation | a do-not-translate span was translated or altered | 1,260 |
At least 400 records per error category (our minimum for a reliable
per-category estimate). Seven asset classes are covered, each with at least 400
records in the scored set: xliff (1,548), software_placeholder (1,503),
icu_messageformat (1,467), markdown_inline (1,431), template_variable
(1,350), html_tag (1,323), do_not_translate (1,260). The dev split contains
every class.
Real records from the open dev split, truncated for width. Angle brackets in markup are shown as ⟨ ⟩ because this platform strips raw HTML-like tags; the data files contain the ordinary characters.
| item_id | target | source text | target text | asset classes | error categories |
|---|---|---|---|---|---|
| d1-001527 | ca | {HOURS, plural, =1 {This device will be saved for 1 hour and you can connect without a code next time...}} | {HOURS,plural, =1{Aquest dispositiu es desarà durant 1 hora i et podràs connectar sense un codi la propera vegada...}} | icu_messageformat | broken_icu |
| d1-000934 | ca | ⟨xliff:g id="app_name" example="Gmail"⟩%1$s⟨/xliff:g⟩ isn't available right now. This is managed by... | ⟨xliff:g id="APP_NAME_0"⟩%1$s⟨/xliff:g⟩ no està disponible en aquests moments. Aquesta opció es gestiona a... | xliff, software_placeholder | missing_asset, corrupted_syntax |
| d1-000048 | ca | FileName: Name of the file, including the path, that you want to test attributes of. If you do not enter a path, ⟨emph⟩SetAttr⟨/emph⟩... | FileName: Nom del fitxer, inclòs el camí, del qual voleu provar els atributs. Si no introduïu un camí, ⟨emph⟩SetAttr⟨/emph⟩... | html_tag, do_not_translate | dnt_violation, invalid_nesting |
| d1-000418 | ca | This ⟨emph⟩Fontwork⟨/emph⟩ dialog is only available for Fontwork in old Writer text documents that were created prior to %PRODUCTNAME... | Aquest diàleg ⟨emph⟩Fontwork⟨/emph⟩ només està disponible per al Fontwork de documents de text del Writer creats amb una versió anterior a %PRODUCTNAME... | html_tag, template_variable | missing_asset, moved_paired_tag |
| d1-000161 | ca | ⟨emph⟩Reference⟨/emph⟩ (list of options) is the position of the cell to be examined... | ⟨emph⟩Referència⟨/emph⟩ (llista d'opcions) és la posició de la cel·la que s'ha d'examinar... | html_tag, markdown_inline | wrong_order, extra_asset |
Key-aligned localization catalogs (msgid / resource name / JSON key); values are human translations.
| Corpus | License |
|---|---|
| Apache OpenOffice (openoffice-translation) | Apache-2.0 |
| Chromium (generated_resources + ui_strings) | BSD-3-Clause |
| AOSP Settings / frameworks/base | Apache-2.0 |
| DSpace dspace-angular | BSD-3-Clause |
| Flutter Gallery | BSD-3-Clause |
| Godot editor-l10n | MIT |
All upstream licenses are permissive and compatible with a CC-BY-4.0 open layer;
upstream attribution notices accompany the release in THIRD_PARTY_NOTICES.md.
Injected HTML/Markdown assets (added to meet the per-class minimum) are flagged
in provenance.
Harvest key-aligned catalogs; keep only strings translated in EN plus all nine targets; deduplicate; drop fuzzy, obsolete and stale entries.
Asset extraction with deterministic parsers and regular expressions:
XLIFF placeholders (xliff:g), HTML, printf / positional / named
placeholders, {{template}} variables, inline ICU, Markdown, and
do-not-translate spans (URLs, emails, brand terms verbatim in all
references).
Plural conversion: Android plurals resources and gettext plurals
re-serialized as inline ICU {count, plural, ...} with the original human
translations.
Injection for classes the UI catalogs lack (HTML, Markdown): one tag pair around a verbatim anchor identical across the source and all human translations.
Splits: ~10/70/20, stratified by asset-class profile, partitioned by
item_id.
Two checks are run on the dataset itself before release. They validate the benchmark, not any particular MT system.
Discrimination check (K1): can the dataset separate systems that preserve inline assets from systems that do not? Contrasting baseline systems are scored with the real scoring script, and the damaging baselines must come out significantly worse (paired bootstrap, p-value below 0.05 - that is, the gap is too large to be chance).
False-positive check (K2): does the scoring script ever flag a correct human translation as an error? The released references and legal variants of them are rescored; the target is 0%.
| Check | Result |
|---|---|
| Discrimination (K1) | PASS - in a live run of MT engines, a structure-blind baseline is statistically separated from a tag-aware system on 6 of 7 asset classes (7 of 7 in the offline simulation), paired bootstrap p-value below 0.05; inline ICU is the one class not separated in the live run |
| False positives (K2) | PASS - 0.0% |
| Reference self-check | 5,391/5,391 - every released reference passes the scoring script |
| Croissant 1.0 | croissant.json, mlcroissant-validated |
The build is deterministic and seeded; rebuilding produces a bit-identical
package, and checksums.sha256 (shipped in the archive) verifies a download.
Rebuilding the inputs starts from the reference translations, so the open
layer alone regenerates and verifies the dev split but not the withheld test
and hidden material.
The scoring script and the full build pipeline are open-source at https://github.com/Prompsit/prompsit-mdc - the dataset content itself is distributed here on MDC.
Asset inventories and positions are machine-extracted and checked by the scoring script rather than manually double-annotated; all released references pass the D1 checks (self-check 5,391/5,391).
ref_tag_positions are located by verbatim search (NFKC plus regex
word-break); assets whose surface differs in the human translation are omitted
rather than guessed.
Injected pairs wrap verbatim anchors only.
Converted ICU items are re-serializations of plural tables (provenance-flagged).