Task: MT
Release Date: 6/30/2026
Format: JSONL
Size: 165.23 KB
Share
An evaluation benchmark that tests whether machine translation keeps a document's structure intact - its heading, paragraph, list, table, link, and image tree - while translating only the text. The documents are composed from human translation segments shared with the D1 inline-asset dataset on a templated structure: 160 HTML documents x 9 target languages (ca/es/fr/it/pt-PT/de/nl/pl/ru) = 1,440 records. Scoring is fully automatic: for each record, deterministic checks compare the document tree of the system output against the source and report which of five error categories occur (lost or duplicated node, block-order change, table-cell corruption, broken link or image, round-trip failure); a record passes when none does, and a system's score is its pass rate, reported overall and per category. Reference labels are generated automatically from the source, not human-annotated. No training set is shipped; the dev split is a small labelled set for optional few-shot prompting or sanity checks. The open layer contains the dev split (with references), the test inputs (references withheld), and a contrastive pack of (correct, damaged) minimal pairs; the hidden split is not distributed. The scoring code is open-source at https://github.com/Prompsit/prompsit-mdc - see DATASHEET.md inside the archive for construction method, checks, and limits.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
Test-split gold withheld; hidden split not distributed.
Forbidden Usage
No additional legal restriction beyond CC-BY-4.0. For benchmark integrity, disclose any training or tuning on the open test inputs and do not report such runs as uncontaminated official benchmark results.
Ethical Review
Permissively licensed text and templated structure only; no personal data.
Intended Use
Evaluation of machine-translation systems for document-structure preservation across nine languages. No training set is shipped.
Version 1.0 | Schema 0.1 | Scoring rule: the document tree must be preserved - same nodes, same block order, same table shape, same link and image targets; only the text is translated
An HTML document carries structure beyond its text: headings, paragraphs, lists, tables, links, and images arranged in a tree. This dataset tests whether machine translation preserves that tree while translating only the text. Each document is composed from human translation segments shared with the D1 inline-asset dataset, arranged on a templated structure, and exists in English plus nine target languages. Inline tags inside text are covered by the D1 dataset and locale data by the D2 dataset; this dataset scores the document tree, with no double-counting.
Reference labels are generated automatically and deterministically from the source by HTML tree and round-trip parsers; they are not human annotations. Every released reference passes the scoring script (1,440/1,440).
Download the open layer from this page and unpack it: data/dev.jsonl
(inputs plus reference translations), data/test.input.jsonl (inputs
only), data/contrastive.dev.jsonl.
Translate the source strings with the MT system you want to evaluate - any system works; no special integration is required.
Score the outputs with the open-source scoring script - score_item(...)
in https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d4-document-structure-integrity/build
(documents.py). Each record passes or fails per error category; your score is
the pass rate.
To verify your harness first, score the dev references themselves: they must pass 100%. For an official, comparable score on the withheld test and hidden splits, contact info@prompsit.com.
The task: translate each HTML document from English into the target language while keeping the document tree intact.
Scoring is fully automatic; no human judges are involved. For every record, a deterministic scoring script (open-source: https://github.com/Prompsit/prompsit-mdc/tree/main/datasets/d4-document-structure-integrity/build) inspects the system output and reports which error categories occur (see the Error categories table below). A record passes if no error category occurs. The score of a system is its pass rate: the fraction of records that pass, reported overall and per error category. The checks compare the document tree of the output against the source document, so no reference translation of the output is needed.
The scoring entry point is score_item(...) in documents.py there; the README and AGENTS.md at the repository root
walk through scoring your own outputs.
The six automated checks are: roundtrip_valid (strict XML well-formedness via
ElementTree, which catches broken attributes and unclosed tags a lenient HTML
parser would swallow), tree_match (the node multiset matches the source),
block_order (the block sequence, table internals excluded), table_shape
(per-table rows x columns), links_images (href and src verbatim), and
segment_count (text-segment count). All are compared against the source
structure, which is language-independent.
Domain: machine-translation quality evaluation for document-structure preservation.
Size: 1,440 records; the open archive contains the dev split (with references), the test inputs, the contrastive pairs, README, this datasheet, third-party notices, manifest, and Croissant metadata.
Structure: JSONL records with source and target HTML, a structural fingerprint of the document tree (node types and counts), link and image targets, expected invariants, error-category tags, split, and provenance.
License: CC-BY-4.0 open layer; upstream attributions in
THIRD_PARTY_NOTICES.md.
Intended use: evaluation of MT systems for document-structure preservation across nine target languages. No training set is shipped.
Ethical review: permissively licensed text and templated structure only; no personal data.
Contact: Prompsit Language Engineering, info@prompsit.com.
Dataset on MDC (download the open layer): https://mozilladatacollective.com/datasets/cmr0moi2k01c0mk07eocv137z
The test references and the hidden split are withheld, so a score on those splits reflects performance on unseen inputs rather than answers a system could have memorised. For an independent evaluation of an MT or LLM system on the withheld splits, contact Prompsit at info@prompsit.com.
| Split | File | Records | What it contains |
|---|---|---|---|
| Open dev | data/dev.jsonl | 144 | inputs plus the reference translation and labels |
| Test inputs | data/test.input.jsonl | 1,008 | inputs only; references withheld |
| Test references | data/test.ref.jsonl | 1,008 | withheld, retained by Prompsit |
| Hidden | - | 288 | never distributed |
| Contrastive | data/contrastive.dev.jsonl | 720 | verification records for (correct, damaged) minimal pairs; the damaged variants are regenerated bit-identically from the dev split by the open build pipeline, and each pair is separated by the scoring script |
160 sources x 9 languages = 1,440 records. Split ~10% dev / 70% test / 20%
hidden (sources: 16 / 112 / 32), partitioned by item_id, so a source and its
nine translations never cross splits.
No training set is shipped. The dev split is a small labelled set for optional few-shot prompting or sanity checks; it is not required to run the benchmark. The test references and the entire hidden split are withheld by Prompsit so that systems cannot be tuned to them.
en into ca, es, fr, it, pt-PT, de, nl, pl, ru. Every source document is
present in all nine languages with the same document tree, so per-language
scores are directly comparable.
Every record is tagged with the error categories it can expose; the scoring script detects each category with a dedicated automated check. Severity is reported alongside a failure for error analysis; it does not change the pass/fail rule.
| Error category | What it means | Severity | Records |
|---|---|---|---|
lost_or_duplicated_node | a node from the source tree is missing or duplicated | Major | 1,152 |
block_order_change | block elements appear in a different order | Major | 1,152 |
table_cell_corruption | a table's rows-by-columns shape changed | Major | 1,152 |
broken_link_image | an href or src no longer matches the source verbatim | Major | 1,152 |
roundtrip_failure | the output is no longer well-formed XML | Critical | 1,152 |
At least 400 records per error category (our minimum for a reliable per-category estimate). Each error category is exercised in all 1,152 dev+test records; the 288 hidden records are additional.
Real records from the open dev split, truncated for width. Angle brackets in markup are shown as ⟨ ⟩ because this platform strips raw HTML-like tags; the data files contain the ordinary characters.
| item_id | target | source text | target text | structure profile | error categories |
|---|---|---|---|---|---|
| d4-000023 | ca | ⟨html⟩⟨body⟩ ⟨h1⟩{COUNT, plural, =1 {an address} other {# addresses}}⟨/h1⟩ ⟨p⟩{MINUTES, plural, =1 {1m} other {#m}}⟨/p⟩... | ⟨html⟩⟨body⟩ ⟨h1⟩{COUNT,plural, =1{1 adreça}other{# adreces}}⟨/h1⟩ ⟨p⟩{MINUTES,plural, =1{1 m}other{# m}}⟨/p⟩... | h1 p h2 p ul(3) p+a table(2x2) img | lost_or_duplicated_node, roundtrip_failure |
| d4-000025 | ca | ⟨html⟩⟨body⟩ ⟨h1⟩⟨xliff:g id="numprocess"⟩%1$d⟨/xliff:g⟩ process and ⟨xliff:g id="numservices"⟩%2$d⟨/xliff:g⟩ service⟨/h1⟩ ⟨h2⟩{COUNT, plural, =1 {Item} other {# items}}⟨/h2⟩... | ⟨html⟩⟨body⟩ ⟨h1⟩⟨xliff:g id="NUMPROCESS"⟩%1$d⟨/xliff:g⟩ procés i ⟨xliff:g id="NUMSERVICES"⟩%2$d⟨/xliff:g⟩ servei⟨/h1⟩ ⟨h2⟩{COUNT,plural, =1{Element}other{# elements}}⟨/h2⟩... | h1 h2 p p ul(3) table(2x2) p+a img | block_order_change, table_cell_corruption |
| d4-000066 | ca | ⟨html⟩⟨body⟩ ⟨h1⟩MIDI Input on Channel=%s Message=%s⟨/h1⟩ ⟨p⟩at ⟨xliff:g id="time" example="2:33 am"⟩%s⟨/xliff:g⟩⟨/p⟩ ⟨h2⟩⟨xliff:g id="extension" example="PDF"⟩%1$s⟨/xliff:g⟩ file⟨/h2⟩... | ⟨html⟩⟨body⟩ ⟨h1⟩Entrada de MIDI al Canal=%s Missatge=%s⟨/h1⟩ ⟨p⟩a les ⟨xliff:g id="TIME"⟩%s⟨/xliff:g⟩⟨/p⟩ ⟨h2⟩Fitxer ⟨xliff:g id="EXTENSION"⟩%1$s⟨/xliff:g⟩⟨/h2⟩... | h1 p h2 p table(2x2) ul(3) p+a img | broken_link_image, block_order_change |
The structure profile column lists the block sequence of the document: the
three documents above share the same building blocks (headings, paragraphs, a
three-item list, a 2x2 table, a link, an image) but arrange them in different
orders, which is exactly what the block_order check must track.
| Content | License | Structure |
|---|---|---|
| Human translations (shared with the D1 inline-asset dataset) | per-segment (Apache-2.0 / BSD-3-Clause / MIT, inherited) | templated HTML (headings, paragraphs, list, table, link, image) |
The text is human parallel translation; the structural scaffolding is
templated, and the structure is exactly what this dataset scores. Upstream
attribution notices accompany the release in THIRD_PARTY_NOTICES.md.
Segment pool: human translation segments shared with the D1 inline-asset dataset; each segment exists in English plus all nine target languages.
Composition: build_dataset.py assembles each document from those
segments on a fixed template - headings, paragraphs, a three-item list, a
2x2 table, a link, and an image - with the block order varied between
documents.
Reference labels: a structural fingerprint of the document tree (node types and counts), the block sequence, per-table shapes, and the link and image targets are extracted deterministically from the source; the checks compare the output against this fingerprint.
Splits: ~10/70/20, partitioned by item_id.
Two checks are run on the dataset itself before release. They validate the benchmark, not any particular MT system.
Discrimination check (K1): can the dataset separate systems that preserve the document tree from systems that do not? Contrasting baseline systems are scored with the real scoring script, and the damaging baselines must come out significantly worse (paired bootstrap, p-value below 0.05 - that is, the gap is too large to be chance).
False-positive check (K2): does the scoring script ever flag a correct human translation as an error? The released references and legal variants of them are rescored; the target is 0%.
| Check | Result |
|---|---|
| Discrimination (K1) | PASS - structure-blind baselines (structure flattener, raw copy) score 0% where a structure-aware system scores 100%, separated on 5 of 5 error categories (paired bootstrap p-value below 0.05). The baselines are simulated corruption operators scored with the real scoring script. |
| False positives (K2) | PASS - 0.0% flips over 7,200 legal variants (whitespace, reindent, collapse, doctype, upper-case tags) |
| Reference self-check | 1,440/1,440 - every released reference passes the scoring script |
| Croissant 1.0 | croissant.json, mlcroissant-validated |
The build is deterministic and seeded; rebuilding produces a bit-identical
package, and checksums.sha256 (shipped in the archive) verifies a download.
Documents are composed from the D1 translation segments, most of which are
withheld, so the open layer alone does not regenerate the test and hidden
splits.
The scoring script and the full build pipeline are open-source at https://github.com/Prompsit/prompsit-mdc - the dataset content itself is distributed here on MDC.
Reference labels are machine-generated by deterministic parsers rather than human task-annotation; all released references pass the checks (self-check 1,440/1,440).
HTML documents only.
Templated structure with human-translated text; structural variety is limited to one template family with varied block order.
Scores structural shape (tree, block order, table dimensions, link and image targets, segment count) plus strict round-trip validity. Content-to-node alignment (which translated segment landed in which cell or block) is not verified - a swap that preserves shape passes.