Task: MT
Release Date: 6/30/2026
Format: JSONL
Size: 128.28 KB
Share
An evaluation benchmark that tests whether machine translation preserves the skeleton of structured resource files - keys, schema, and non-translatable fields - while translating only the values, with output that still parses. English resource fragments in four formats (Android XML, JSON, Java .properties, Flutter ARB) are paired with human translations from AOSP Settings string resources (Apache-2.0): 420 sources x 9 target languages (ca/es/fr/it/pt-PT/de/nl/pl/ru) = 3,780 records. Scoring is fully automatic: for each record, deterministic checks parse the system output and report which of six error categories occur - parser_break, key_path_translated, schema_changed, value_untranslated, nonvalue_modified_literal (a symbol, number, placeholder or the ARB @@locale field altered; checked in every format) and nonvalue_modified_marked (a token marked translatable="false" altered; XML only); a record passes when none does, and a system's score is its pass rate, reported overall and per category. Reference labels are generated automatically from the source, not human-annotated. No training set is shipped; the dev split is a small labelled set for optional few-shot prompting or sanity checks. The open layer contains the dev split (with references), the test inputs (references withheld), and a contrastive pack of (correct, damaged) minimal pairs; the hidden split is not distributed. The scoring code is open-source at https://github.com/Prompsit/prompsit-mdc - see DATASHEET.md inside the archive for construction method, checks, and limits.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Test-split gold withheld; hidden split not distributed.
Forbidden Usage
No additional legal restriction beyond CC-BY-4.0. For benchmark integrity, disclose any training or tuning on the open test inputs and do not report such runs as uncontaminated official benchmark results.
Ethical Review
Permissively-licensed (Apache-2.0 AOSP) resources only; no personal data.
Intended Use
Evaluation of machine-translation systems for value-only translation of structured resource files (XML, JSON, .properties, ARB) across nine languages. No training set is shipped.
Version 1.0 | Schema 0.2 | Scoring rule: keys, schema and non-translatable fields must be preserved; only values are translated; the file must still parse
Localization ships as structured resource files. This dataset tests whether machine translation preserves the skeleton of such a file - its keys, its schema, and its non-translatable fields - while translating only the values, and returns output that still parses. Each record is a small resource file in one of four formats (Android XML, JSON, Java .properties, Flutter ARB) paired with a human translation into one of nine languages; the values are human translations from AOSP Settings string resources.
Reference labels are generated automatically and deterministically from the source by format parsers; they are not human annotations. Every released reference passes the scoring script (3,780/3,780). Values that carry inline tags are the D1 dataset's concern; D3 covers keys, schema and plain values, with no double-counting.
Download the open layer from this page and unpack it: data/dev.jsonl
(inputs plus reference translations), data/test.input.jsonl (inputs
only), data/contrastive.dev.jsonl.
Translate the source strings with the MT system you want to evaluate - any system works; no special integration is required.
Score the outputs with the open-source scoring script - score_item(...)
in https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d3-structured-resource-integrity/build
(resources.py). Each record passes or fails per error category; your score is
the pass rate.
To verify your harness first, score the dev references themselves: they must pass 100%. For an official, comparable score on the withheld test and hidden splits, contact info@prompsit.com.
The task: translate each resource file from English into the target language, changing only the values while keeping keys, schema and non-translatable fields unchanged, so that the file still parses.
Scoring is fully automatic; no human judges are involved. For every record, a deterministic scoring script (open-source: https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d3-structured-resource-integrity/build) inspects the system output and reports which error categories occur (see the Error categories table below). A record passes if no error category occurs. The score of a system is its pass rate: the fraction of records that pass, reported overall and per error category. The checks parse the output resource file and compare its keys, schema and non-translatable fields against the source, so no reference translation of the output is needed.
The scoring entry point is score_item(...) in resources.py there; the README and AGENTS.md at the repository root
walk through scoring your own outputs.
The automated checks are: parser_valid, key_path_match, schema_match,
value_translated (reference-aware: a value that legitimately stays identical
in the human reference, such as a proper noun, is not counted as
untranslated), and nonvalue_preserved in two tiers (see the Error categories
section).
Domain: software localization and machine-translation quality evaluation for structured resource files (XML, JSON, .properties, ARB).
Size: 3,780 records; the open archive contains the dev split (with references), the test inputs, the contrastive pairs, README, this datasheet, third-party notices, manifest, and Croissant metadata.
Structure: JSONL records with the source resource file and its reference translation, format, key/value metadata, non-translatable fields, expected invariants, error-category tags, split, and provenance.
License: CC-BY-4.0 open layer; upstream attributions in
THIRD_PARTY_NOTICES.md.
Intended use: evaluation of MT systems for value-only translation of structured resource files across nine target languages. No training set is shipped.
Ethical review: permissively licensed software resource strings only; no personal data.
Contact: Prompsit Language Engineering, info@prompsit.com.
Dataset on MDC (download the open layer): https://mozilladatacollective.com/datasets/cmr0mny2b01asns073985z0va
The test references and the hidden split are withheld, so a score on those splits reflects performance on unseen inputs rather than answers a system could have memorised. For an independent evaluation of an MT or LLM system on the withheld splits, contact Prompsit at info@prompsit.com.
| Split | File | Records | What it contains |
|---|---|---|---|
| Open dev | data/dev.jsonl | 378 | inputs plus the reference translation and labels |
| Test inputs | data/test.input.jsonl | 2,646 | inputs only; references withheld |
| Test references | data/test.ref.jsonl | 2,646 | withheld, retained by Prompsit |
| Hidden | - | 756 | never distributed |
| Contrastive | data/contrastive.dev.jsonl | 1,890 | verification records for (correct, damaged) minimal pairs; the damaged variants are regenerated bit-identically from the dev split by the open build pipeline, and each pair is separated by the scoring script |
420 sources x 9 languages = 3,780 records. Split ~10% dev / 70% test / 20%
hidden (sources: 42 / 294 / 84), stratified by format and error-category
profile and partitioned by item_id, so a source and its nine translations
never cross splits.
No training set is shipped. The dev split is a small labelled set for optional few-shot prompting or sanity checks; it is not required to run the benchmark. The test references and the entire hidden split are withheld by Prompsit so that systems cannot be tuned to them.
en into ca, es, fr, it, pt-PT, de, nl, pl, ru. Every source is present in
all nine languages in the same format, so per-language scores are directly
comparable.
Every record is tagged with the error categories it can expose; the scoring script detects each category with a dedicated automated check. Severity is reported alongside a failure for error analysis; it does not change the pass/fail rule.
| Error category | What it means | Severity | Records |
|---|---|---|---|
parser_break | the output no longer parses | Critical | 3,024 |
key_path_translated | a key or path was translated | Major | 3,024 |
schema_changed | the structure or nesting changed | Major | 3,024 |
value_untranslated | a translatable value was left in English | Major | 3,024 |
nonvalue_modified_literal | a non-translatable token - symbols, numbers, ratios, degrees, placeholders, the ARB @@locale field - was altered; checked in every format | Major | 2,592 |
nonvalue_modified_marked | an alphabetic token explicitly marked translatable="false" was altered; XML only | Major | 432 |
At least 400 records per error category (our minimum for a reliable
per-category estimate); counts are over the scored dev and test records. Four
format profiles are covered: xml (native Android), json, properties, and
arb (which carries the non-translatable @@locale metadata field).
Non-translatable fields are protected in two tiers. Tier 1 covers tokens with
no letters - symbols, numbers, ratios, degrees, placeholders, and the ARB
@@locale field. These can never be legitimately translated, so they are
checked in every format, no marker needed. Tier 2 covers alphabetic tokens
(acronyms, brands, proper nouns), which sometimes are legitimately
translatable; these are checked only in XML, where the source explicitly marks
them with translatable="false". An unmarked alphabetic token that passes
through unchanged in JSON, .properties or ARB is not penalised; no inline
do-not-translate list is shipped, which keeps this dataset separate from the
D5 dataset.
Real records from the open dev split, truncated for width. Angle brackets in markup are shown as ⟨ ⟩ because this platform strips raw HTML-like tags; the data files contain the ordinary characters.
| item_id | target | source text | target text | format | error categories |
|---|---|---|---|---|---|
| d3-000001 | ca | ⟨?xml version="1.0" encoding="utf-8"?⟩ ⟨resources⟩ ⟨string name="accessibility_action_label_panel_slice"⟩enter settings⟨/string⟩ ... ⟨string name="external_display_rotation_180" translatable="false"⟩180°⟨/string⟩ ⟨/resources⟩ | ⟨?xml version="1.0" encoding="utf-8"?⟩ ⟨resources⟩ ⟨string name="accessibility_action_label_panel_slice"⟩obre la configuració⟨/string⟩ ... ⟨string name="external_display_rotation_180" translatable="false"⟩180°⟨/string⟩ ⟨/resources⟩ | xml | parser_break, key_path_translated, schema_changed, value_untranslated, nonvalue_modified_literal |
| d3-000124 | ca | { "apn_user": "Username", "app_and_notification_dashboard_summary": "Recent apps, default apps", ... "external_display_rotation_270": "270°" } | { "apn_user": "Nom d'usuari", "app_and_notification_dashboard_summary": "Aplicacions recents, aplicacions predeterminades", ... "external_display_rotation_270": "270°" } | json | parser_break, key_path_translated, schema_changed, value_untranslated, nonvalue_modified_literal |
| d3-000223 | ca | battery_app_usage=App usage since last full charge battery_app_usage_for=App usage for %s ... | battery_app_usage=Ús de l'aplicació des de la darrera càrrega completa battery_app_usage_for=Ús d'aplicacions entre %s ... | properties | parser_break, key_path_translated, schema_changed, value_untranslated, nonvalue_modified_literal |
| d3-000322 | ca | { "@@locale": "en", "bounce_keys_dialog_title": "Bounce key threshold", ... "print_job_summary": "%1$s %2$s" } | { "@@locale": "ca", "bounce_keys_dialog_title": "Llindar de la tecla de rebot", ... "print_job_summary": "%1$s %2$s" } | arb | parser_break, key_path_translated, schema_changed, value_untranslated, nonvalue_modified_literal |
Every record exercises the first five categories; XML records whose
non-translatable token is alphabetic (marked translatable="false" in the
source) also exercise nonvalue_modified_marked.
| Corpus | License | What |
|---|---|---|
AOSP Settings string resources (aosp-mirror/platform_packages_apps_Settings@7c598253ff60) | Apache-2.0 | EN plus 9-language strings.xml, identical keys, human translations |
The pinned upstream files can be re-downloaded with the open build pipeline. Values are human translations from AOSP; the JSON, .properties and
ARB profiles re-serialize the same key-value pairs. The upstream license is
permissive and compatible with a CC-BY-4.0 open layer; the attribution notice
accompanies the release in THIRD_PARTY_NOTICES.md.
Harvest: download the pinned AOSP Settings string resources - English plus the nine target languages with identical keys; keep only keys present in all ten files.
Classification: mark each key as translatable (the English value
differs from the translations) or non-translatable (identical in every
language); assign non-translatable tokens to Tier 1 (letterless) or Tier 2
(alphabetic, marked translatable="false" in XML).
Fragmenting and serialization: group keys into small resource fragments
(three translatable keys plus one non-translatable) and serialize each
fragment into one of the four format profiles - Android XML, JSON, Java
.properties, or Flutter ARB (which adds the @@locale metadata field).
Pairing: the source is the English serialization; the reference is the target-language serialization of the same fragment with the human values; each record is tagged with the error categories it can expose.
Splits: ~10/70/20, stratified by format and error-category profile,
partitioned by item_id.
Two checks are run on the dataset itself before release. They validate the benchmark, not any particular MT system.
Discrimination check (K1): can the dataset separate systems that preserve the resource skeleton from systems that do not? Contrasting baseline systems are scored with the real scoring script, and the damaging baselines must come out significantly worse (paired bootstrap, p-value below 0.05 - that is, the gap is too large to be chance).
False-positive check (K2): does the scoring script ever flag a correct human translation as an error? The released references and legal variants of them are rescored; the target is 0%.
| Check | Result |
|---|---|
| Discrimination (K1) | PASS - resource-blind baselines (untranslated passthrough, raw copy) score 0% where a resource-aware system scores 100%, separated on 6 of 6 error categories (paired bootstrap p-value below 0.05); single-error baselines are caught on their target category. The baselines are simulated corruption operators scored with the real scoring script. |
| False positives (K2) | PASS - 0.0% flips over 10,260 legal variants (key reorder, reindent, comments, blank lines, trailing newline) |
| Reference self-check | 3,780/3,780 - every released reference passes the scoring script |
| Croissant 1.0 | croissant.json, mlcroissant-validated |
The build is deterministic and seeded; rebuilding produces a bit-identical
package, and checksums.sha256 (shipped in the archive) verifies a download.
D3 is built from a pinned public upstream (the AOSP revision in the Source
data table), so the open build pipeline can regenerate the dataset in full;
official, comparable scores on the withheld splits are still issued only by
Prompsit (see Independent evaluation).
The scoring script and the full build pipeline are open-source at https://github.com/Prompsit/prompsit-mdc - the dataset content itself is distributed here on MDC.
Reference labels are machine-generated by format parsers and checked by the scoring script rather than manually double-annotated; all released references pass the D3 checks (self-check 3,780/3,780).
One source corpus (AOSP Settings); the JSON, .properties and ARB profiles re-serialize the AOSP key-value pairs rather than native files of those formats.
Flat resources only; nested-path schemas such as deep JSON or YAML are outside this package.
The baseline results shipped with this package are synthetic (deterministically generated corruptions used to exercise the scoring script), not outputs of real MT systems.
Values that carry inline tags are covered by the D1 dataset; adherence to supplied glossaries and translation memories by the D5 dataset; D3 covers keys, schema and plain values, with no double-counting.