BOUQuET

Links and references

An interface for contributing new volunteer translations: https://bouquet.metademolab.com
A paper by Omnilingual Team, 2025 with the original design on the dataset: https://arxiv.org/abs/2502.04314,
A paper by Omnilingual Team, 2026 with its extension to more languages: https://arxiv.org/abs/2603.16309
A leaderboard of machine translation systems based on BOUQuET: https://huggingface.co/spaces/facebook/bouquet
Code for evaluating new models : https://github.com/facebookresearch/bouquet
A mirror version of the dataset, with additional configurations and Met-BOUQuET data (human judgements of machine translation quality): https://huggingface.co/datasets/facebook/bouquet

Uses

The dataset is intended for evaluation of machine translation quality. By purpose, it is similar to FLORES+ or WMT24++. Unlike these datasets, BOUQuET focuses more on linguistic diversity, both across languages (including some extremely low-resourced languages) and within a language (covering different registers).

The base BOUQuET dataset is not intended as a training dataset, but the dev subset may be used for validation during model development.

As default evaluation metrics, we recommend ChrF++ and MetricX. For difficult target languages, mode-based metrics like MetricX should be adjusted with LID scores (e.g. by GlotLID) to penalize off-target translations. Our evaluation codebase contains the recommended implementations of the evaluation metrics.

Dataset composition

BOUQuET consists of short paragraphs, fully parallel in all languages at the sentence level. The dataset is distributed both at the sentence level and at the paragraph level. By default, data with both levels is loaded; the paragraph_level and sentence_level configs may be used to load the levels separately.

The public portion of the dataset contains two splits:

dev: 504 unique sentences, 120 paragraphs
test: 854 unique sentences, 198 paragraphs

An additional split made up of 632 unique sentences and 144 paragraphs is being held out for quality assurance purposes and is not distributed here.

Mozilla contributions

A part of the BOUQuET dataset has been contributed by the Mozilla Foundation. This parts consist of translations into the following 16 language varieties: azz_Latn, bas_Latn, bft_Arab, brh_Arab, bsh_Arab, cak_Latn, chv_Cyrl, cux_Latn, dua_Latn, eto_Latn, kls_Arab, ksf_Latn, kxp_Arab, skr_Arab, tui_Latn, ydg_Arab.

We are extremely grateful to our Mozilla Foundation partners for this collaboration!

Data columns

The BOUQuET dataset contains the following fields:

level (str): "sentence_level" (paragraph-level texts can be reconstructed by joining the sentences of each paragraph with a whitespace or a newline, as per the newline_next column)
split (str): "dev" or "test"
uniq_id (str): identifier of the dataset item (e.g. P464-S1 for sentence-level, P464 for paragraph-level data)
src_lang (str): NLLB-compatible non-English language code (such as hin_Deva)
tgt_lang (str): always eng_Latn (but because the data is multiway parallel, any two of languages can be paired as a source and target instead)
src_text (str): non-English text
tgt_text (str): English text
orig_text (str): the original text (sentence or paragraph), which sometimes corresponds to src_text
par_comment (str): comment to the whole paragraph
newline_next (bool): whether the sentence should be followed by a newline in the paragraph
par_id (str): paragraph id (e.g. P464)
domain (str): one of the 8 domains (see their list below)
register (str): three-letter identifier of the register of the original sentence (see the paper for the explanation of their meaning)
tags (str): comma-separated linguistic tags of a sentence (see the paper for more details)

Languages

The language codes used in the dataset always include consist the three-letter ISO 639-3 language code (e.g. "eng") and the four-letter ISO 15924 writing system code (e.g. "Latn"). Optionally, they may also include a 8-letter Glottolog code to define a more fine-grained dialect (as of now, this is applied only for Brazilian Portuguese).

Currently, BOUQuET covers 275 language varieties (more will be added later): 8 source ("pivot") languages + English + 266 added languoids (including two donated by community).

List of languages

Code.	ISO 639-3	ISO 15924	Language	Family	Comment
aar_Latn	aar	Latn	Afar	Afro-Asiatic
abl_Latn	abl	Latn	Lampung Nyo	Austronesian
afr_Latn	afr	Latn	Afrikaans	Indo-European
agr_Latn	agr	Latn	Aguaruna	Chicham
aiq_Arab	aiq	Arab	Aimaq	Indo-European
als_Latn	als	Latn	Tosk Albanian	Indo-European
amh_Ethi	amh	Ethi	Amharic	Afro-Asiatic
ami_Latn	ami	Latn	Amis	Austronesian
ane_Latn	ane	Latn	Xârâcùù	Austronesian
apc_Arab	apc	Arab	Levantine Arabic	Afro-Asiatic
arh_Latn	arh	Latn	Arhuaco	Chibchan
arn_Latn	arn	Latn	Mapudungun	Araucanian
arz_Arab	arz	Arab	Egyptian Arabic	Afro-Asiatic	Code-switching with Modern Standard Arabic (arb_Arab)
arz_Latn	arz	Latn	Egyptian Arabic (Romanized)	Afro-Asiatic	Code-switching with Modern Standard Arabic (arb_Latn)
asm_Beng	asm	Beng	Assamese	Indo-European
ayr_Latn	ayr	Latn	Central Aymara	Aymaran
ayz_Latn	ayz	Latn	Mai Brat	Maybratic
azb_Arab	azb	Arab	South Azerbaijani	Turkic
azj_Latn	azj	Latn	North Azerbaijani	Turkic
azm_Latn	azm	Latn	Ipalapa Amuzgo	Otomanguean
azz_Latn	azz	Latn	Highland Puebla Nahuatl	Uto-Aztecan
bak_Cyrl	bak	Cyrl	Bashkir	Turkic	Community-contributed
bam_Latn	bam	Latn	Bambara	Mande
bas_Latn	bas	Latn	Basaa	Atlantic-Congo
bba_Latn	bba	Latn	Baatonum	Atlantic-Congo
bel_Cyrl	bel	Cyrl	Belarusian	Indo-European
ben_Beng	ben	Beng	Bengali	Indo-European
ben_Latn	ben	Latn	Bengali (Romanized)	Indo-European
bft_Arab	bft	Arab	Balti	Sino-Tibetan
bhb_Deva	bhb	Deva	Bhili	Indo-European
bho_Deva	bho	Deva	Bhojpuri	Indo-European
bod_Tibt	bod	Tibt	Tibetan	Sino-Tibetan
bos_Latn	bos	Latn	Bosnian	Indo-European
bre_Latn	bre	Latn	Breton	Indo-European
brh_Arab	brh	Arab	Brahui	Dravidian
brx_Deva	brx	Deva	Bodo (India)	Sino-Tibetan
bsh_Arab	bsh	Arab	Kateviri	Indo-European
bsk_Arab	bsk	Arab	Burushaski	Burushaski
bul_Cyrl	bul	Cyrl	Bulgarian	Indo-European
cak_Latn	cak	Latn	Kaqchikel	Mayan
cat_Latn	cat	Latn	Catalan	Indo-European
ceb_Latn	ceb	Latn	Cebuano	Austronesian
ces_Latn	ces	Latn	Czech	Indo-European
che_Cyrl	che	Cyrl	Chechen	Nakh-Daghestanian	Community-contributed
chr_Cher	chr	Cher	Cherokee	Iroquoian
chv_Cyrl	chv	Cyrl	Chuvash	Turkic
cja_Arab	cja	Arab	Western Cham	Austronesian
cjk_Latn	cjk	Latn	Chokwe	Atlantic-Congo
ckb_Arab	ckb	Arab	Sorani Kurdish	Indo-European
ckl_Latn	ckl	Latn	Kibaku	Afro-Asiatic
cmn_Hans	cmn	Hans	Mandarin (Simplified)	Sino-Tibetan
cmn_Hant	cmn	Hant	Mandarin (Traditional)	Sino-Tibetan
crk_Cans	crk	Cans	Plains Cree	Algic
crk_Latn	crk	Latn	Plains Cree	Algic
cux_Latn	cux	Latn	Tepeuxila Cuicatec	Otomanguean
cym_Latn	cym	Latn	Welsh	Indo-European
dan_Latn	dan	Latn	Danish	Indo-European
daq_Deva	daq	Deva	Dandami Maria	Dravidian
deu_Latn	deu	Latn	German	Indo-European
dgo_Deva	dgo	Deva	Dogri	Indo-European
dik_Latn	dik	Latn	Southwestern Dinka	Nilotic
diq_Latn	diq	Latn	Zazaki - Southern Zaza	Indo-European
div_Thaa	div	Thaa	Dhivehi	Indo-European
djc_Latn	djc	Latn	Dar Daju	Dajuic
dje_Latn	dje	Latn	Zarma	Songhay
dtm_Latn	dtm	Latn	Tomo Kan Dogon	Dogon
dts_Latn	dts	Latn	Toro So Dogon	Dogon
dua_Latn	dua	Latn	Duala	Atlantic-Congo
dzo_Tibt	dzo	Tibt	Dzongkha	Sino-Tibetan
ekk_Latn	ekk	Latn	Standard Estonian	Uralic
ell_Grek	ell	Grek	Modern Greek	Indo-European
enb_Latn	enb	Latn	Markweeta	Nilotic
eng_Latn	eng	Latn	English	Indo-European
enl_Latn	enl	Latn	Enlhet	Lengua-Mascoy
eto_Latn	eto	Latn	Eton	Atlantic-Congo
eus_Latn	eus	Latn	Basque	Basque
ewo_Latn	ewo	Latn	Ewondo	Atlantic-Congo
fao_Latn	fao	Latn	Faroese	Indo-European
fia_Copt	fia	Copt	Nobiin	Nubian
fin_Latn	fin	Latn	Finnish	Uralic
fra_Latn	fra	Latn	French	Indo-European
fry_Latn	fry	Latn	Western Frisian	Indo-European
fuc_Latn	fuc	Latn	Pulaar	Atlantic-Congo
fuv_Latn	fuv	Latn	Nigerian Fulfulde	Atlantic-Congo
fvr_Latn	fvr	Latn	Fur	Furan
gax_Latn	gax	Latn	Borana-Arsi-Guji Oromo	Afro-Asiatic
gaz_Latn	gaz	Latn	West Central Oromo	Afro-Asiatic
gil_Latn	gil	Latn	Gilbertese	Austronesian
gkp_Latn	gkp	Latn	Kpelle (Guinea)	Mande
gla_Latn	gla	Latn	Scottish Gaelic	Indo-European
gle_Latn	gle	Latn	Irish	Indo-European
glg_Latn	glg	Latn	Galician	Indo-European
gom_Deva	gom	Deva	Goan Konkani	Indo-European
guc_Latn	guc	Latn	Wayuu	Arawakan
gug_Latn	gug	Latn	Paraguayan Guarani	Tupian
guj_Gujr	guj	Gujr	Gujarati	Indo-European
guz_Latn	guz	Latn	Gusii	Atlantic-Congo
gxx_Latn	gxx	Latn	Southern Wè	Kru
hat_Latn	hat	Latn	Haitian Creole	Indo-European
hau_Latn	hau	Latn	Hausa	Afro-Asiatic
heb_Hebr	heb	Hebr	Hebrew	Afro-Asiatic
heh_Latn	heh	Latn	Hehe	Atlantic-Congo
hin_Deva	hin	Deva	Hindi	Indo-European
hin_Latn	hin	Latn	Hindi (Romanized)	Indo-European
hne_Deva	hne	Deva	Chhattisgarhi	Indo-European
hrv_Latn	hrv	Latn	Croatian	Indo-European
hun_Latn	hun	Latn	Hungarian	Uralic
hve_Latn	hve	Latn	San Dionisio del Mar Huave	Huavean
hye_Armn	hye	Armn	Armenian	Indo-European
ibo_Latn	ibo	Latn	Igbo	Atlantic-Congo
ijc_Latn	ijc	Latn	Izon	Ijoid
ilo_Latn	ilo	Latn	Iloko	Austronesian
ind_Latn	ind	Latn	Indonesian	Austronesian
irk_Latn	irk	Latn	Iraqw	Afro-Asiatic
isl_Latn	isl	Latn	Icelandic	Indo-European
ita_Latn	ita	Latn	Italian	Indo-European
jav_Latn	jav	Latn	Javanese	Austronesian
jmc_Latn	jmc	Latn	Machame	Atlantic-Congo
jnj_Latn	jnj	Latn	Yemsa	Ta-Ne-Omotic
jpn_Jpan	jpn	Jpan	Japanese	Japonic
kaa_Cyrl	kaa	Cyrl	Karakalpak	Turkic
kac_Latn	kac	Latn	Kachin	Sino-Tibetan
kai_Latn	kai	Latn	Karekare	Afro-Asiatic
kal_Latn	kal	Latn	Kalaallisut	Eskimo-Aleut
kam_Latn	kam	Latn	Kamba	Atlantic-Congo
kan_Knda	kan	Knda	Kannada	Dravidian
kat_Geor	kat	Geor	Georgian	Kartvelian
kaz_Cyrl	kaz	Cyrl	Kazakh	Turkic
kea_Latn	kea	Latn	Kabuverdianu	Indo-European
kek_Latn	kek	Latn	Kekchí	Mayan
khk_Cyrl	khk	Cyrl	Halh Mongolian	Mongolic-Khitan
khm_Khmr	khm	Khmr	Central Khmer	Austroasiatic
khq_Latn	khq	Latn	Koyra Chiini Songhay	Songhay
khw_Arab	khw	Arab	Khowar	Indo-European
kin_Latn	kin	Latn	Kinyarwanda	Atlantic-Congo
kir_Cyrl	kir	Cyrl	Kyrgyz	Turkic
kls_Arab	kls	Arab	Kalasha	Indo-European
kmb_Latn	kmb	Latn	Kimbundu	Atlantic-Congo
kmr_Latn	kmr	Latn	Kurmanji Kurdish	Indo-European
knc_Arab	knc	Arab	Central Kanuri	Saharan
knw_Latn	knw	Latn	Kung-Ekoka	Kxa
kor_Kore	kor	Kore	Korean	Koreanic
krt_Latn	krt	Latn	Tumari Kanuri	Saharan
kru_Deva	kru	Deva	Kurukh	Dravidian
ksf_Latn	ksf	Latn	Bafia	Atlantic-Congo
ktu_Latn	ktu	Latn	Kituba	Atlantic-Congo
kuj_Latn	kuj	Latn	Kuria	Atlantic-Congo
kwy_Latn	kwy	Latn	San Salvador Kongo	Atlantic-Congo
kxp_Arab	kxp	Arab	Koli Wadiyari	Indo-European
lao_Laoo	lao	Laoo	Lao	Tai-Kadai
led_Latn	led	Latn	Lendu	Central Sudanic
lgg_Latn	lgg	Latn	Lugbara	Central Sudanic
lij_Latn	lij	Latn	Ligurian	Indo-European
lim_Latn	lim	Latn	Limburgish	Indo-European
lin_Latn	lin	Latn	Kinshasa Lingala	Atlantic-Congo
lir_Latn	lir	Latn	Liberian Kreyol	Pidgin
lit_Latn	lit	Latn	Lithuanian	Indo-European
loa_Latn	loa	Latn	Loloda	North Halmahera
loh_Latn	loh	Latn	Narim	Surmic
lug_Latn	lug	Latn	Ganda	Atlantic-Congo
luo_Latn	luo	Latn	Luo	Nilotic
lvs_Latn	lvs	Latn	Standard Latvian	Indo-European
maf_Latn	maf	Latn	Mafa	Afro-Asiatic
mai_Deva	mai	Deva	Maithili	Indo-European
mal_Mlym	mal	Mlym	Malayalam	Dravidian
mam_Latn	mam	Latn	Mam	Mayan
mar_Deva	mar	Deva	Marathi	Indo-European
mas_Latn	mas	Latn	Masai	Nilotic
mey_Latn	mey	Latn	Hassaniyya Arabic	Afro-Asiatic
mie_Latn	mie	Latn	Ocotepec Mixtec	Otomanguean
min_Arab	min	Arab	Minangkabau	Austronesian
miq_Latn	miq	Latn	Miskito	Misumalpan
mkd_Cyrl	mkd	Cyrl	Macedonian	Indo-European
mlt_Latn	mlt	Latn	Maltese	Afro-Asiatic
mos_Latn	mos	Latn	Mossi	Atlantic-Congo
mri_Latn	mri	Latn	Māori	Austronesian
mtq_Latn	mtq	Latn	Muong	Austroasiatic
mya_Mymr	mya	Mymr	Burmese	Sino-Tibetan
mzl_Latn	mzl	Latn	Mazatlán Mixe	Mixe-Zoque
naq_Latn	naq	Latn	Nama	Khoe-Kwadi
nhe_Latn	nhe	Latn	Eastern Huasteca Nahuatl	Uto-Aztecan
nld_Latn	nld	Latn	Standard Dutch	Indo-European
nlv_Latn	nlv	Latn	Orizaba Nahuatl	Uto-Aztecan
nno_Latn	nno	Latn	Nynorsk	Indo-European
npi_Deva	npi	Deva	Nepali	Indo-European
nso_Latn	nso	Latn	Northern Sotho	Atlantic-Congo
nus_Latn	nus	Latn	Nuer	Nilotic
nya_Latn	nya	Latn	Nyanja	Atlantic-Congo
ory_Orya	ory	Orya	Oriya	Indo-European
pbs_Latn	pbs	Latn	Central Pame	Otomanguean
pbt_Arab	pbt	Arab	Southern Pashto	Indo-European
pcm_Latn	pcm	Latn	Nigerian Pidgin	Indo-European
pes_Arab	pes	Arab	Western Persian	Indo-European
plt_Latn	plt	Latn	Plateau Malagasy	Austronesian
pnb_Guru	pnb	Guru	Western Punjabi	Indo-European
pol_Latn	pol	Latn	Polish	Indo-European
por_Latn	por	Latn	Brazilian Portuguese	Indo-European
quc_Latn	quc	Latn	K'iche'	Mayan
quh_Latn	quh	Latn	South Bolivian Quechua	Quechuan
quz_Latn	quz	Latn	Cusco Quechua	Quechuan
rob_Latn	rob	Latn	Tae’	Austronesian
roh_Latn	roh	Latn	Romansh	Indo-European
ron_Latn	ron	Latn	Romanian	Indo-European
rus_Cyrl	rus	Cyrl	Russian	Indo-European
sat_Olck	sat	Olck	Santali	Austroasiatic
sba_Latn	sba	Latn	Ngambay	Central Sudanic
scn_Latn	scn	Latn	Sicilian	Indo-European
sgc_Latn	sgc	Latn	Kipsigis	Nilotic
shn_Mymr	shn	Mymr	Shan	Tai-Kadai
sif_Latn	sif	Latn	Siamou	Siamou
sin_Sinh	sin	Sinh	Sinhala	Indo-European
skr_Arab	skr	Arab	Saraiki	Indo-European
slk_Latn	slk	Latn	Slovak	Indo-European
slv_Latn	slv	Latn	Slovene	Indo-European
sme_Latn	sme	Latn	Northern Sami	Uralic
sna_Latn	sna	Latn	Shona	Atlantic-Congo
snd_Arab	snd	Arab	Sindhi	Indo-European
som_Latn	som	Latn	Somali	Afro-Asiatic
sot_Latn	sot	Latn	Southern Sotho	Atlantic-Congo
spa_Latn	spa	Latn	Spanish	Indo-European
sro_Latn	sro	Latn	Sardinian Campidanese	Indo-European
srp_Cyrl	srp	Cyrl	Serbian	Indo-European
ssw_Latn	ssw	Latn	Swati	Atlantic-Congo
sun_Latn	sun	Latn	Sundanese	Austronesian
swe_Latn	swe	Latn	Swedish	Indo-European
swh_Latn	swh	Latn	Swahili	Atlantic-Congo
szl_Latn	szl	Latn	Silesian	Indo-European
tam_Latn	tam	Latn	Tamil (Romanized)	Dravidian
tam_Taml	tam	Taml	Tamil	Dravidian
taq_Latn	taq	Latn	Tamashek (Romanized)	Afro-Asiatic
taq_Tfng	taq	Tfng	Tamashek	Afro-Asiatic
tat_Cyrl	tat	Cyrl	Tatar	Turkic
tda_Latn	tda	Latn	Tagdal	Songhay
tel_Latn	tel	Latn	Telugu (Romanized)	Dravidian
tel_Telu	tel	Telu	Telugu	Dravidian
tgk_Cyrl	tgk	Cyrl	Tajik	Indo-European
tgl_Latn	tgl	Latn	Tagalog	Austronesian
tha_Thai	tha	Thai	Thai	Tai-Kadai
tir_Ethi	tir	Ethi	Tigrinya	Afro-Asiatic
toc_Latn	toc	Latn	Coyutla Totonac	Totonacan
tpi_Latn	tpi	Latn	Tok Pisin	Indo-European
tpl_Latn	tpl	Latn	Tlacoapa Me’phaa	Otomanguean
tsg_Latn	tsg	Latn	Tausug	Austronesian
tsn_Latn	tsn	Latn	Tswana	Atlantic-Congo
tso_Latn	tso	Latn	Tsonga	Atlantic-Congo
tsz_Latn	tsz	Latn	Purepecha	Tarascan
tui_Latn	tui	Latn	Tupuri	Atlantic-Congo
tur_Latn	tur	Latn	Turkish	Turkic
twi_Latn	twi	Latn	Twi	Atlantic-Congo
tzh_Latn	tzh	Latn	Tzeltal	Mayan
tzm_Tfng	tzm	Tfng	Central Atlas Tamazight	Afro-Asiatic
uig_Arab	uig	Arab	Uyghur	Turkic
ukr_Cyrl	ukr	Cyrl	Ukrainian	Indo-European
umb_Latn	umb	Latn	Umbundu	Atlantic-Congo
urd_Arab	urd	Arab	Urdu	Indo-European
urd_Latn	urd	Latn	Urdu (Romanized)	Indo-European
uzn_Latn	uzn	Latn	Northern Uzbek	Turkic
ven_Latn	ven	Latn	Venda	Atlantic-Congo
vie_Latn	vie	Latn	Vietnamese	Austroasiatic
vmw_Latn	vmw	Latn	Makhuwa	Atlantic-Congo
war_Latn	war	Latn	Waray	Austronesian
wlv_Latn	wlv	Latn	Bermejo Wichí	Mataguayan
wol_Latn	wol	Latn	Wolof	Atlantic-Congo
wuu_Hans	wuu	Hans	Wu Chinese	Sino-Tibetan
xho_Latn	xho	Latn	Xhosa	Atlantic-Congo
xuu_Latn	xuu	Latn	Khwedam	Khoe-Kwadi
ydd_Hebr	ydd	Hebr	Eastern Yiddish	Indo-European
ydg_Arab	ydg	Arab	Yadgha	Indo-European
yor_Latn	yor	Latn	Yoruba	Atlantic-Congo
yua_Latn	yua	Latn	Yucateco	Mayan
yue_Hant	yue	Hant	Yue Chinese	Sino-Tibetan
zai_Latn	zai	Latn	Isthmus Zapotec	Otomanguean
zsm_Latn	zsm	Latn	Colloquial Malay	Austronesian	Code-switching with Standard Malay (zlm_Latn)
zne_Latn	zne	Latn	Zande	Atlantic-Congo
zul_Latn	zul	Latn	Zulu	Atlantic-Congo

Dataset Creation

The base dataset has been created manually from scratch by professional linguists, by composing the source sentences that cover a variety of domains and registers in 8 diverse non-English languages: Egyptian Arabic (alternating with Modern Standard Arabic when appropriate), French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish.

For each of the source languages, the sentences have been created in the following 8 domains:

How-to, written tutorials or instructions
Conversations (dialogues)
Narration (creative writing that doesn’t include dialogues)
Social media posts
Social media comments (reactive)
Other web content
Reflective piece
Miscellaneous (address to a nation, disaster response, etc.)

Apart from the domains, a variety of registers (contextual styles) were used. Each sentence is annotated with the register characterized by three features: connectedness, preparedness, and social differential.

The linguists who were creating the dataset were instructed to maintain the diversity of sentence lengths, word orders, sentence structures, and other linguistic characteristics.

Subsequently, the source sentences were translated from the 8 source languages into English, and then, into the other languages.

See the paper for more details.

Contribution

To contribute to the dataset (adding translations for a new language, or verifying some of the existing translations), please use the web annotation tool at https://bouquet.metademolab.com. Please contact us in case of any questions!

Citation

If you are referring to this dataset, please cite the BOUQuET paper and the Omnilingual MT paper.

@inproceedings{andrews-etal-2025-bouquet,
    title = "{BOUQ}u{ET} : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation",
    author = "Andrews, Pierre  and
      Artetxe, Mikel  and
      Meglioli, Mariano Coria  and
      Costa-juss{\`a}, Marta R.  and
      Chuang, Joe  and
      Dale, David  and
      Duppenthaler, Mark  and
      Ekberg, Nathanial Paul  and
      Gao, Cynthia  and
      Licht, Daniel Edward  and
      Maillard, Jean  and
      Mourachko, Alexandre  and
      Ropers, Christophe  and
      Saleem, Safiyyah  and
      S{\'a}nchez, Eduardo  and
      Tsiamas, Ioannis  and
      Turkatenko, Arina  and
      Ventayol-Boada, Albert  and
      Yates, Shireen",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    m nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1400/",
    doi = "10.18653/v1/2025.emnlp-main.1400",
    pages = "27515--27535",
    ISBN = "979-8-89176-332-6",
}

@misc{omnilingual2026,
    title={Omnilingual {MT}: Machine Translation for 1,600 Languages},
    author={The Omnilingual MT Team and Belen Alastruey and Niyati Bafna and Andrea Caciolai and Kevin Heffernan and Artyom Kozhevnikov and Christophe Ropers and Eduardo S{\'a}nchez and Charles-Eric Saint-James and Ioannis Tsiamas and Chierh Cheng and Joe Chuang and Paul-Ambroise Duquenne and Mark Duppenthaler and Nate Ekberg and Cynthia Gao and Pere Llu{\'i}s Huguet Cabot and Jo{\~a}o Maria Janeiro and Jean Maillard and Gabriel Mejia Gonzalez and Holger Schwenk and Edan Toledo and Arina Turkatenko and Albert Ventayol-Boada and Rashel Moritz and Alexandre Mourachko and Surya Parimi and Mary Williamson and Shireen Yates and David Dale and Marta R. Costa-juss{\`a}},
    year={2026},
    eprint={},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2603.16309},
}

Glossary

Domain. By the term domain, we mean different spaces in which language is produced in speech, sign, or writing (e.g., books, social media, news, Wikipedia, organization websites, official documents, direct messaging, texting). In this dataset, we focus solely on the written modality.
Register. We understand the term register as a functional variety of language that includes socio-semiotic properties, as expressed in [Halliday and Matthiessen (2004)], or more simply as a "contextual style", as presented in [Labov (1991), pp.79–99]. In that regard, a register is a specific variety of language used to best fit a specific communicative purpose in a specific situation.

Description

Specifics

Considerations

Processes

Metadata