Task: MT
Release Date: 6/30/2026
Format: JSONL
Size: 163.87 KB
Share
An evaluation benchmark that tests whether machine translation renders locale-sensitive values - numbers, dates, currency amounts, and physical units - in the target-locale convention instead of copying the source form through; the value itself must not change (no exchange rates, no metrication). Each of 680 source items embeds one value in a human-translated sentence shared with the D1 inline-asset dataset, across nine target locales (ca-ES/es-ES/fr-FR/it-IT/pt-PT/de-DE/nl-NL/pl-PL/ru-RU) = 6,120 records; the expected target-locale forms come from Unicode CLDR data via Babel. Scoring is fully automatic: for each record, deterministic checks compare the rendered value in the system output against the target-locale forms and report which of six error categories occur (wrong decimal separator, wrong grouping, mis-converted date/time, broken currency format, wrong unit format, missing entity); a record passes when none does, and a system's score is its pass rate, reported overall and per category. Reference labels are generated automatically from the source, not human-annotated. No training set is shipped; the dev split is a small labelled set for optional few-shot prompting or sanity checks. The open layer contains the dev split (with references), the test inputs (references withheld), and a contrastive pack of (correct, damaged) minimal pairs; the hidden split is not distributed. The scoring code is open-source at https://github.com/Prompsit/prompsit-mdc - see DATASHEET.md inside the archive for construction method, checks, and limits.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Test-split gold withheld; hidden split not distributed.
Forbidden Usage
No additional legal restriction beyond CC-BY-4.0. For benchmark integrity, disclose any training or tuning on the open test inputs and do not report such runs as uncontaminated official benchmark results.
Ethical Review
Permissively licensed text and CLDR locale data only; no personal data.
Intended Use
Evaluation of machine-translation systems for locale-data integrity (numbers, dates, units, currency) across nine locales. No training set is shipped.
Version 1.0 | Schema 0.2 | Scoring rule: locale-sensitive values must be rendered in the target-locale convention; the value itself must not change
Translated software and documentation carry locale-sensitive values: numbers,
dates, currency amounts, and physical units. The correct output depends on the
target locale - 1,200,000 in en-US text should read 1.200.000 in ca-ES - so
a translation that copies the source form through is wrong even when every word
is right. This dataset tests whether machine translation renders such values in
the target-locale convention: decimal and grouping separators, date pattern,
currency symbol and position, localized unit symbol. The value itself must not
change - no exchange-rate conversion, no metrication. Each record embeds one
value in a neutral [...] slot of a human-translated sentence shared with the
D1 dataset. A markup-preservation check like D1's would wrongly penalise a
correctly localised number, so locale rendering is scored as its own dataset.
Reference labels are generated automatically and deterministically from the source by rendering each value with Unicode CLDR locale data via Babel; they are not human annotations. This CLDR renderer is the reference oracle - the deterministic source of the expected target-locale forms and their accepted variants. Every released reference passes the scoring script (6,120/6,120).
Download the open layer from this page and unpack it: data/dev.jsonl
(inputs plus reference translations), data/test.input.jsonl (inputs
only), data/contrastive.dev.jsonl.
Translate the source strings with the MT system you want to evaluate - any system works; no special integration is required.
Score the outputs with the open-source scoring script - score_item(...)
in https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d2-locale-data-integrity/build
(validators.py). Each record passes or fails per error category; your score is
the pass rate.
To verify your harness first, score the dev references themselves: they must pass 100%. For an official, comparable score on the withheld test and hidden splits, contact info@prompsit.com.
The task: translate each sentence from English into the target locale while rendering the embedded locale-sensitive value in that locale's convention.
Scoring is fully automatic; no human judges are involved. For every record, a deterministic scoring script (open-source: https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d2-locale-data-integrity/build) inspects the system output and reports which error categories occur (see the Error categories table below). A record passes if no error category occurs. The score of a system is its pass rate: the fraction of records that pass, reported overall and per error category. The checks compare the rendered value in the output against the target-locale forms derived from Unicode CLDR (Babel) for that record, so no reference translation of the output is needed.
The scoring entry point is score_item(...) in validators.py there; the README and AGENTS.md at the repository root
walk through scoring your own outputs.
Every record in v1.0 is a format-track record: the value is rendered as-is
in the target locale, and only the locale form is scored. A conversion track
that does change the value (exchange rates, metrication) is defined in the
record schema as an opt-in extension; v1.0 ships no conversion records, and
conversion_required is false in every record. Legal variants - for example
the currency code instead of the symbol, or a long unit form - are listed per
record in accepted_variants and are not penalised.
Domain: software localization and machine-translation quality evaluation for locale-specific rendering of numbers, dates, currencies, and units.
Size: 6,120 records; the open archive contains the dev split (with references), the test inputs, the contrastive pairs, README, this datasheet, third-party notices, manifest, and Croissant metadata.
Structure: JSONL records with source and target text, source and target locales, entity metadata (kind, semantic value, expected rendering, accepted variants), expected invariants, error-category tags, split, and provenance.
License: CC-BY-4.0 open layer; upstream attributions in
THIRD_PARTY_NOTICES.md.
Intended use: evaluation of MT systems for target-locale rendering of locale-sensitive values across nine locales. No training set is shipped.
Ethical review: permissively licensed text and CLDR locale data only; no personal data.
Contact: Prompsit Language Engineering, info@prompsit.com.
Dataset on MDC (download the open layer): https://mozilladatacollective.com/datasets/cmr0mnoo201bwmk07nh4yc04u
The test references and the hidden split are withheld, so a score on those splits reflects performance on unseen inputs rather than answers a system could have memorised. For an independent evaluation of an MT or LLM system on the withheld splits, contact Prompsit at info@prompsit.com.
| Split | File | Records | What it contains |
|---|---|---|---|
| Open dev | data/dev.jsonl | 630 | inputs plus the reference translation and labels |
| Test inputs | data/test.input.jsonl | 4,266 | inputs only; references withheld |
| Test references | data/test.ref.jsonl | 4,266 | withheld, retained by Prompsit |
| Hidden | - | 1,224 | never distributed |
| Contrastive | data/contrastive.dev.jsonl | 821 | (correct, damaged) minimal pairs from the dev split, each pair separated by the scoring script |
680 sources x 9 locales = 6,120 records. Split ~10% dev / 70% test / 20%
hidden (sources: 70 / 474 / 136), stratified by error category and partitioned
by item_id, so a source and its nine translations never cross splits.
No training set is shipped. The dev split is a small labelled set for optional few-shot prompting or sanity checks; it is not required to run the benchmark. The test references and the entire hidden split are withheld by Prompsit so that systems cannot be tuned to them.
en-US into ca-ES, es-ES, fr-FR, it-IT, pt-PT, de-DE, nl-NL, pl-PL, ru-RU -
the same nine-language matrix as the D1 dataset. The dataset is keyed by full
locale rather than language alone, because a language code underspecifies
separators, date patterns, and currency format. Every source is present in all
nine locales with the same embedded value, so per-locale scores are directly
comparable.
Every record is tagged with the error categories it can expose; the scoring script detects each category with a dedicated automated check. Severity is reported alongside a failure for error analysis; it does not change the pass/fail rule.
| Error category | What it means | Severity | Records |
|---|---|---|---|
wrong_decimal_separator | the decimal separator is not the target-locale one (comma vs point) | Major | 2,223 |
wrong_grouping | digit grouping is missing or uses the wrong separator for the target locale | Minor | 1,449 |
mis_converted_datetime | the date is not rendered in the target-locale pattern (for example left in the source form) | Major | 1,152 |
broken_currency_format | the currency symbol, code, or its position does not follow the target-locale pattern | Major | 1,152 |
wrong_unit_format | the unit symbol is not the localized form the target locale uses | Minor | 425 |
missing_entity | the output drops the value entirely | Critical | all |
At least 400 records per error category (our minimum for a reliable
per-category estimate). Record counts are over the scored set (the dev and test
splits). Four entity kinds are covered: number (160 sources), currency (160),
date (160), and unit (200). missing_entity can occur on any record, since
every record carries exactly one value. wrong_unit_format is scoreable in the
four locales where CLDR prescribes a unit symbol different from the English
form (fr-FR, pl-PL, pt-PT, ru-RU); elsewhere the unchanged symbol is the
correct rendering.
Real records from the open dev split, truncated for width. Angle brackets in markup are shown as ⟨ ⟩ because this platform strips raw HTML-like tags; the data files contain the ordinary characters.
| item_id | target | source text | target text | entity (input -> expected) | error categories |
|---|---|---|---|---|---|
| d2-000023 | ca | ⟨ahelp hid="..."⟩Sorts the selection from the lowest value to the highest value. You can define the sort rules...⟨/ahelp⟩ ... [£19,999.90] | ⟨ahelp hid="..."⟩Ordena la selecció del valor més petit al més gran. Podeu definir les regles d'ordenació...⟨/ahelp⟩ ... [19.999,90 £] | currency: £19,999.90 -> 19.999,90 £ | wrong_decimal_separator, wrong_grouping, broken_currency_format |
| d2-000163 | ca | ⟨item type="input"⟩=OFFSET(A1;2;2)⟨/item⟩ returns the value in cell C3 (A1 moved by two rows and two columns down)... [Jul 1, 2024] | ⟨item type="input"⟩=DESPLAÇAMENT(A1;2;2)⟨/item⟩ retorna el valor de la cel·la C3 (A1 desplaçada dues files i dues columnes cap avall)... [1 de jul. 2024] | date: Jul 1, 2024 -> 1 de jul. 2024 | mis_converted_datetime |
| d2-000321 | ca | ⟨emph⟩Server⟨/emph⟩ is the name of a server application. ⟨item type="productname"⟩%PRODUCTNAME⟨/item⟩ applications have the server name... [1,200,000] | ⟨emph⟩Servidor⟨/emph⟩ és el nom d'una aplicació de servidor. En el cas de les aplicacions de l'⟨item type="productname"⟩%PRODUCTNAME⟨/item⟩... [1.200.000] | number: 1,200,000 -> 1.200.000 | wrong_decimal_separator, wrong_grouping |
| d2-000489 | ca | ⟨ahelp hid=""⟩Enter or edit general information for an ⟨link ...⟩XML filter⟨/link⟩.⟨/ahelp⟩ [350 lb] | ⟨ahelp hid=""⟩Introduïu o editeu la informació general per a un ⟨link ...⟩filtre XML⟨/link⟩.⟨/ahelp⟩ [350 lb] | unit: 350 lb -> 350 lb (ca-ES keeps lb; 350 lliures also accepted) | missing_entity only (the ca-ES unit symbol is unchanged) |
The sentences that host the values are human translations shared with the D1
inline-asset dataset, drawn from key-aligned localization catalogs; each
record's provenance field names its corpus, license, URL, revision, and the
originating D1 item. The expected target-locale forms come from Unicode CLDR
locale data as bundled with Babel 2.18.0.
| Source | License |
|---|---|
| Apache OpenOffice (openoffice-translation) | Apache-2.0 |
| AOSP Settings / frameworks/base | Apache-2.0 |
| Chromium (generated_resources + ui_strings) | BSD-3-Clause |
| DSpace dspace-angular | BSD-3-Clause |
| Unicode CLDR (via Babel 2.18.0) | Unicode-3.0 |
All upstream licenses are permissive and compatible with a CC-BY-4.0 open layer;
upstream attribution notices accompany the release in THIRD_PARTY_NOTICES.md.
Sentence selection: human-translated segments present in EN plus all nine target languages, reused from the D1 harvest with their licenses and provenance inherited per record.
Value generation: 680 items - 160 numbers, 160 currency amounts, 160 dates, 200 unit measures - with values picked deterministically from fixed pools, seeded per item.
Rendering: the source form is produced for en-US and the expected form for each target locale from CLDR data via Babel 2.18.0, together with the accepted legal variants and, for each applicable error category, a damaged form used to build the contrastive pairs.
Injection: the value is appended to the sentence in a neutral [...]
slot, in the same position in the source and in every reference.
Splits: ~10/70/20, stratified by error category, partitioned by
item_id.
Two checks are run on the dataset itself before release. They validate the benchmark, not any particular MT system.
Discrimination check (K1): can the dataset separate systems that render locale-sensitive values in the target-locale convention from systems that do not? Contrasting baseline systems are scored with the real scoring script, and the damaging baselines must come out significantly worse (paired bootstrap, p-value below 0.05).
False-positive check (K2): does the scoring script ever flag a correct human translation as an error? The released references and legal variants of them are rescored; the target is 0%.
| Check | Result |
|---|---|
| Discrimination (K1) | PASS - locale-blind passthrough baselines score 16.5% overall where a locale-aware system scores 100%, separated on all 5 error categories that have a corruption operator (paired bootstrap p-value below 0.05); the sixth, missing_entity, fires only when the output drops the value entirely. The baselines are simulated corruption operators scored with the real scoring script. |
| False positives (K2) | PASS - 0.0% flips over 20,036 legal variants (NBSP/NNBSP/thin-space grouping, date length, currency code, unit long form) |
| Reference self-check | 6,120/6,120 - every released reference passes the scoring script |
| Croissant 1.0 | croissant.json, mlcroissant-validated |
The build is deterministic and seeded; rebuilding produces a bit-identical
package, and checksums.sha256 (shipped in the archive) verifies a download.
The build seed and the locale-data pin (Babel 2.18.0 with
its bundled CLDR) are recorded in manifest.json, shipped in the archive.
Rebuilding starts from the reference translations, so the open layer alone
does not regenerate the withheld splits.
The scoring script and the full build pipeline are open-source at https://github.com/Prompsit/prompsit-mdc - the dataset content itself is distributed here on MDC.
Reference renderings are machine-generated from CLDR (via Babel) and checked by the scoring script rather than manually double-annotated; all released references pass the D2 checks (self-check 6,120/6,120).
One value per source; the host sentences are reused across entity kinds.
Baseline scores in this package come from simulated systems (deterministic corruption operators); no outputs of live MT engines are included.
The package contains es-ES and pt-PT profiles; it does not include pt-BR.