Task: MT
Release Date: 7/3/2026
Format: TSV
Size: 50.65 MB
Share
BOUQuET is a multi-way parallel, multi-centric and multi-register/domain dataset and benchmark for machine translation quality, developed by the Omnilingual team at FAIR (Meta). The underlying texts (318 paragraphs consisting of 1358 sentences in the publicly available subset) have been handcrafted by linguists in 8 diverse languages (Egyptian Arabic, French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish) and translated to English and 266 other language varieties (language + script + dialect combinations). The dataset is intended to be extensible to virtually any other written language. Volunteers can contribute new translations via https://bouquet.metademolab.com.
Licensing
Creative Commons Attribution 4.0 International (CC-BY-4.0)
https://spdx.org/licenses/CC-BY-4.0.htmlRestrictions/Special Constraints
This data is for evaluation purposes only; You may not use any of this data or its derivatives for training machine learning / AI models. You may only distribute, embed, or otherwise transfer this data or its derivatives via a mechanism that is either private or that implements protections against automated crawling (such as using a password-protected archive or a gating mechanism that requires users to accept these terms before accessing the dataset). Your distributions must retain these terms.
Forbidden Usage
This data is for evaluation purposes only; You may not use any of this data or its derivatives for training machine learning / AI models.
Intended Use
Benchmarking the quality of (potentially massively multilingual) machine translation
An interface for contributing new volunteer translations: https://bouquet.metademolab.com
A paper by Omnilingual Team, 2025 with the original design on the dataset: https://arxiv.org/abs/2502.04314,
A paper by Omnilingual Team, 2026 with its extension to more languages: https://arxiv.org/abs/2603.16309
A leaderboard of machine translation systems based on BOUQuET: https://huggingface.co/spaces/facebook/bouquet
Code for evaluating new models : https://github.com/facebookresearch/bouquet
A mirror version of the dataset, with additional configurations and Met-BOUQuET data (human judgements of machine translation quality): https://huggingface.co/datasets/facebook/bouquet
The dataset is intended for evaluation of machine translation quality. By purpose, it is similar to FLORES+ or WMT24++. Unlike these datasets, BOUQuET focuses more on linguistic diversity, both across languages (including some extremely low-resourced languages) and within a language (covering different registers).
The base BOUQuET dataset is not intended as a training dataset, but the dev subset may be used for validation during model development.
As default evaluation metrics, we recommend ChrF++ and MetricX. For difficult target languages, mode-based metrics like MetricX should be adjusted with LID scores (e.g. by GlotLID) to penalize off-target translations. Our evaluation codebase contains the recommended implementations of the evaluation metrics.
BOUQuET consists of short paragraphs, fully parallel in all languages at the sentence level.
The dataset is distributed both at the sentence level and at the paragraph level.
By default, data with both levels is loaded; the paragraph_level and sentence_level configs may be used to load the levels separately.
The public portion of the dataset contains two splits:
dev: 504 unique sentences, 120 paragraphs
test: 854 unique sentences, 198 paragraphs
An additional split made up of 632 unique sentences and 144 paragraphs is being held out for quality assurance purposes and is not distributed here.
A part of the BOUQuET dataset has been contributed by the Mozilla Foundation.
This parts consist of translations into the following 16 language varieties: azz_Latn, bas_Latn, bft_Arab, brh_Arab, bsh_Arab, cak_Latn, chv_Cyrl, cux_Latn, dua_Latn, eto_Latn, kls_Arab, ksf_Latn, kxp_Arab, skr_Arab, tui_Latn, ydg_Arab.
We are extremely grateful to our Mozilla Foundation partners for this collaboration!
The BOUQuET dataset contains the following fields:
level (str): "sentence_level" (paragraph-level texts can be reconstructed by joining the sentences of each paragraph with a whitespace or a newline, as per the newline_next column)
split (str): "dev" or "test"
uniq_id (str): identifier of the dataset item (e.g. P464-S1 for sentence-level, P464 for paragraph-level data)
src_lang (str): NLLB-compatible non-English language code (such as hin_Deva)
tgt_lang (str): always eng_Latn (but because the data is multiway parallel, any two of languages can be paired as a source and target instead)
src_text (str): non-English text
tgt_text (str): English text
orig_text (str): the original text (sentence or paragraph), which sometimes corresponds to src_text
par_comment (str): comment to the whole paragraph
newline_next (bool): whether the sentence should be followed by a newline in the paragraph
par_id (str): paragraph id (e.g. P464)
domain (str): one of the 8 domains (see their list below)
register (str): three-letter identifier of the register of the original sentence (see the paper for the explanation of their meaning)
tags (str): comma-separated linguistic tags of a sentence (see the paper for more details)
The language codes used in the dataset always include consist the three-letter ISO 639-3 language code (e.g. "eng") and the four-letter ISO 15924 writing system code (e.g. "Latn"). Optionally, they may also include a 8-letter Glottolog code to define a more fine-grained dialect (as of now, this is applied only for Brazilian Portuguese).
Currently, BOUQuET covers 275 language varieties (more will be added later): 8 source ("pivot") languages + English + 266 added languoids (including two donated by community).
List of languages
| Code. | ISO 639-3 | ISO 15924 | Language | Family | Comment |
|---|---|---|---|---|---|
| aar_Latn | aar | Latn | Afar | Afro-Asiatic | |
| abl_Latn | abl | Latn | Lampung Nyo | Austronesian | |
| afr_Latn | afr | Latn | Afrikaans | Indo-European | |
| agr_Latn | agr | Latn | Aguaruna | Chicham | |
| aiq_Arab | aiq | Arab | Aimaq | Indo-European | |
| als_Latn | als | Latn | Tosk Albanian | Indo-European | |
| amh_Ethi | amh | Ethi | Amharic | Afro-Asiatic | |
| ami_Latn | ami | Latn | Amis | Austronesian | |
| ane_Latn | ane | Latn | Xârâcùù | Austronesian | |
| apc_Arab | apc | Arab | Levantine Arabic | Afro-Asiatic | |
| arh_Latn | arh | Latn | Arhuaco | Chibchan | |
| arn_Latn | arn | Latn | Mapudungun | Araucanian | |
| arz_Arab | arz | Arab | Egyptian Arabic | Afro-Asiatic | Code-switching with Modern Standard Arabic (arb_Arab) |
| arz_Latn | arz | Latn | Egyptian Arabic (Romanized) | Afro-Asiatic | Code-switching with Modern Standard Arabic (arb_Latn) |
| asm_Beng | asm | Beng | Assamese | Indo-European | |
| ayr_Latn | ayr | Latn | Central Aymara | Aymaran | |
| ayz_Latn | ayz | Latn | Mai Brat | Maybratic | |
| azb_Arab | azb | Arab | South Azerbaijani | Turkic | |
| azj_Latn | azj | Latn | North Azerbaijani | Turkic | |
| azm_Latn | azm | Latn | Ipalapa Amuzgo | Otomanguean | |
| azz_Latn | azz | Latn | Highland Puebla Nahuatl | Uto-Aztecan | |
| bak_Cyrl | bak | Cyrl | Bashkir | Turkic | Community-contributed |
| bam_Latn | bam | Latn | Bambara | Mande | |
| bas_Latn | bas | Latn | Basaa | Atlantic-Congo | |
| bba_Latn | bba | Latn | Baatonum | Atlantic-Congo | |
| bel_Cyrl | bel | Cyrl | Belarusian | Indo-European | |
| ben_Beng | ben | Beng | Bengali | Indo-European | |
| ben_Latn | ben | Latn | Bengali (Romanized) | Indo-European | |
| bft_Arab | bft | Arab | Balti | Sino-Tibetan | |
| bhb_Deva | bhb | Deva | Bhili | Indo-European | |
| bho_Deva | bho | Deva | Bhojpuri | Indo-European | |
| bod_Tibt | bod | Tibt | Tibetan | Sino-Tibetan | |
| bos_Latn | bos | Latn | Bosnian | Indo-European | |
| bre_Latn | bre | Latn | Breton | Indo-European | |
| brh_Arab | brh | Arab | Brahui | Dravidian | |
| brx_Deva | brx | Deva | Bodo (India) | Sino-Tibetan | |
| bsh_Arab | bsh | Arab | Kateviri | Indo-European | |
| bsk_Arab | bsk | Arab | Burushaski | Burushaski | |
| bul_Cyrl | bul | Cyrl | Bulgarian | Indo-European | |
| cak_Latn | cak | Latn | Kaqchikel | Mayan | |
| cat_Latn | cat | Latn | Catalan | Indo-European | |
| ceb_Latn | ceb | Latn | Cebuano | Austronesian | |
| ces_Latn | ces | Latn | Czech | Indo-European | |
| che_Cyrl | che | Cyrl | Chechen | Nakh-Daghestanian | Community-contributed |
| chr_Cher | chr | Cher | Cherokee | Iroquoian | |
| chv_Cyrl | chv | Cyrl | Chuvash | Turkic | |
| cja_Arab | cja | Arab | Western Cham | Austronesian | |
| cjk_Latn | cjk | Latn | Chokwe | Atlantic-Congo | |
| ckb_Arab | ckb | Arab | Sorani Kurdish | Indo-European | |
| ckl_Latn | ckl | Latn | Kibaku | Afro-Asiatic | |
| cmn_Hans | cmn | Hans | Mandarin (Simplified) | Sino-Tibetan | |
| cmn_Hant | cmn | Hant | Mandarin (Traditional) | Sino-Tibetan | |
| crk_Cans | crk | Cans | Plains Cree | Algic | |
| crk_Latn | crk | Latn | Plains Cree | Algic | |
| cux_Latn | cux | Latn | Tepeuxila Cuicatec | Otomanguean | |
| cym_Latn | cym | Latn | Welsh | Indo-European | |
| dan_Latn | dan | Latn | Danish | Indo-European | |
| daq_Deva | daq | Deva | Dandami Maria | Dravidian | |
| deu_Latn | deu | Latn | German | Indo-European | |
| dgo_Deva | dgo | Deva | Dogri | Indo-European | |
| dik_Latn | dik | Latn | Southwestern Dinka | Nilotic | |
| diq_Latn | diq | Latn | Zazaki - Southern Zaza | Indo-European | |
| div_Thaa | div | Thaa | Dhivehi | Indo-European | |
| djc_Latn | djc | Latn | Dar Daju | Dajuic | |
| dje_Latn | dje | Latn | Zarma | Songhay | |
| dtm_Latn | dtm | Latn | Tomo Kan Dogon | Dogon | |
| dts_Latn | dts | Latn | Toro So Dogon | Dogon | |
| dua_Latn | dua | Latn | Duala | Atlantic-Congo | |
| dzo_Tibt | dzo | Tibt | Dzongkha | Sino-Tibetan | |
| ekk_Latn | ekk | Latn | Standard Estonian | Uralic | |
| ell_Grek | ell | Grek | Modern Greek | Indo-European | |
| enb_Latn | enb | Latn | Markweeta | Nilotic | |
| eng_Latn | eng | Latn | English | Indo-European | |
| enl_Latn | enl | Latn | Enlhet | Lengua-Mascoy | |
| eto_Latn | eto | Latn | Eton | Atlantic-Congo | |
| eus_Latn | eus | Latn | Basque | Basque | |
| ewo_Latn | ewo | Latn | Ewondo | Atlantic-Congo | |
| fao_Latn | fao | Latn | Faroese | Indo-European | |
| fia_Copt | fia | Copt | Nobiin | Nubian | |
| fin_Latn | fin | Latn | Finnish | Uralic | |
| fra_Latn | fra | Latn | French | Indo-European | |
| fry_Latn | fry | Latn | Western Frisian | Indo-European | |
| fuc_Latn | fuc | Latn | Pulaar | Atlantic-Congo | |
| fuv_Latn | fuv | Latn | Nigerian Fulfulde | Atlantic-Congo | |
| fvr_Latn | fvr | Latn | Fur | Furan | |
| gax_Latn | gax | Latn | Borana-Arsi-Guji Oromo | Afro-Asiatic | |
| gaz_Latn | gaz | Latn | West Central Oromo | Afro-Asiatic | |
| gil_Latn | gil | Latn | Gilbertese | Austronesian | |
| gkp_Latn | gkp | Latn | Kpelle (Guinea) | Mande | |
| gla_Latn | gla | Latn | Scottish Gaelic | Indo-European | |
| gle_Latn | gle | Latn | Irish | Indo-European | |
| glg_Latn | glg | Latn | Galician | Indo-European | |
| gom_Deva | gom | Deva | Goan Konkani | Indo-European | |
| guc_Latn | guc | Latn | Wayuu | Arawakan | |
| gug_Latn | gug | Latn | Paraguayan Guarani | Tupian | |
| guj_Gujr | guj | Gujr | Gujarati | Indo-European | |
| guz_Latn | guz | Latn | Gusii | Atlantic-Congo | |
| gxx_Latn | gxx | Latn | Southern Wè | Kru | |
| hat_Latn | hat | Latn | Haitian Creole | Indo-European | |
| hau_Latn | hau | Latn | Hausa | Afro-Asiatic | |
| heb_Hebr | heb | Hebr | Hebrew | Afro-Asiatic | |
| heh_Latn | heh | Latn | Hehe | Atlantic-Congo | |
| hin_Deva | hin | Deva | Hindi | Indo-European | |
| hin_Latn | hin | Latn | Hindi (Romanized) | Indo-European | |
| hne_Deva | hne | Deva | Chhattisgarhi | Indo-European | |
| hrv_Latn | hrv | Latn | Croatian | Indo-European | |
| hun_Latn | hun | Latn | Hungarian | Uralic | |
| hve_Latn | hve | Latn | San Dionisio del Mar Huave | Huavean | |
| hye_Armn | hye | Armn | Armenian | Indo-European | |
| ibo_Latn | ibo | Latn | Igbo | Atlantic-Congo | |
| ijc_Latn | ijc | Latn | Izon | Ijoid | |
| ilo_Latn | ilo | Latn | Iloko | Austronesian | |
| ind_Latn | ind | Latn | Indonesian | Austronesian | |
| irk_Latn | irk | Latn | Iraqw | Afro-Asiatic | |
| isl_Latn | isl | Latn | Icelandic | Indo-European | |
| ita_Latn | ita | Latn | Italian | Indo-European | |
| jav_Latn | jav | Latn | Javanese | Austronesian | |
| jmc_Latn | jmc | Latn | Machame | Atlantic-Congo | |
| jnj_Latn | jnj | Latn | Yemsa | Ta-Ne-Omotic | |
| jpn_Jpan | jpn | Jpan | Japanese | Japonic | |
| kaa_Cyrl | kaa | Cyrl | Karakalpak | Turkic | |
| kac_Latn | kac | Latn | Kachin | Sino-Tibetan | |
| kai_Latn | kai | Latn | Karekare | Afro-Asiatic | |
| kal_Latn | kal | Latn | Kalaallisut | Eskimo-Aleut | |
| kam_Latn | kam | Latn | Kamba | Atlantic-Congo | |
| kan_Knda | kan | Knda | Kannada | Dravidian | |
| kat_Geor | kat | Geor | Georgian | Kartvelian | |
| kaz_Cyrl | kaz | Cyrl | Kazakh | Turkic | |
| kea_Latn | kea | Latn | Kabuverdianu | Indo-European | |
| kek_Latn | kek | Latn | Kekchí | Mayan | |
| khk_Cyrl | khk | Cyrl | Halh Mongolian | Mongolic-Khitan | |
| khm_Khmr | khm | Khmr | Central Khmer | Austroasiatic | |
| khq_Latn | khq | Latn | Koyra Chiini Songhay | Songhay | |
| khw_Arab | khw | Arab | Khowar | Indo-European | |
| kin_Latn | kin | Latn | Kinyarwanda | Atlantic-Congo | |
| kir_Cyrl | kir | Cyrl | Kyrgyz | Turkic | |
| kls_Arab | kls | Arab | Kalasha | Indo-European | |
| kmb_Latn | kmb | Latn | Kimbundu | Atlantic-Congo | |
| kmr_Latn | kmr | Latn | Kurmanji Kurdish | Indo-European | |
| knc_Arab | knc | Arab | Central Kanuri | Saharan | |
| knw_Latn | knw | Latn | Kung-Ekoka | Kxa | |
| kor_Kore | kor | Kore | Korean | Koreanic | |
| krt_Latn | krt | Latn | Tumari Kanuri | Saharan | |
| kru_Deva | kru | Deva | Kurukh | Dravidian | |
| ksf_Latn | ksf | Latn | Bafia | Atlantic-Congo | |
| ktu_Latn | ktu | Latn | Kituba | Atlantic-Congo | |
| kuj_Latn | kuj | Latn | Kuria | Atlantic-Congo | |
| kwy_Latn | kwy | Latn | San Salvador Kongo | Atlantic-Congo | |
| kxp_Arab | kxp | Arab | Koli Wadiyari | Indo-European | |
| lao_Laoo | lao | Laoo | Lao | Tai-Kadai | |
| led_Latn | led | Latn | Lendu | Central Sudanic | |
| lgg_Latn | lgg | Latn | Lugbara | Central Sudanic | |
| lij_Latn | lij | Latn | Ligurian | Indo-European | |
| lim_Latn | lim | Latn | Limburgish | Indo-European | |
| lin_Latn | lin | Latn | Kinshasa Lingala | Atlantic-Congo | |
| lir_Latn | lir | Latn | Liberian Kreyol | Pidgin | |
| lit_Latn | lit | Latn | Lithuanian | Indo-European | |
| loa_Latn | loa | Latn | Loloda | North Halmahera | |
| loh_Latn | loh | Latn | Narim | Surmic | |
| lug_Latn | lug | Latn | Ganda | Atlantic-Congo | |
| luo_Latn | luo | Latn | Luo | Nilotic | |
| lvs_Latn | lvs | Latn | Standard Latvian | Indo-European | |
| maf_Latn | maf | Latn | Mafa | Afro-Asiatic | |
| mai_Deva | mai | Deva | Maithili | Indo-European | |
| mal_Mlym | mal | Mlym | Malayalam | Dravidian | |
| mam_Latn | mam | Latn | Mam | Mayan | |
| mar_Deva | mar | Deva | Marathi | Indo-European | |
| mas_Latn | mas | Latn | Masai | Nilotic | |
| mey_Latn | mey | Latn | Hassaniyya Arabic | Afro-Asiatic | |
| mie_Latn | mie | Latn | Ocotepec Mixtec | Otomanguean | |
| min_Arab | min | Arab | Minangkabau | Austronesian | |
| miq_Latn | miq | Latn | Miskito | Misumalpan | |
| mkd_Cyrl | mkd | Cyrl | Macedonian | Indo-European | |
| mlt_Latn | mlt | Latn | Maltese | Afro-Asiatic | |
| mos_Latn | mos | Latn | Mossi | Atlantic-Congo | |
| mri_Latn | mri | Latn | Māori | Austronesian | |
| mtq_Latn | mtq | Latn | Muong | Austroasiatic | |
| mya_Mymr | mya | Mymr | Burmese | Sino-Tibetan | |
| mzl_Latn | mzl | Latn | Mazatlán Mixe | Mixe-Zoque | |
| naq_Latn | naq | Latn | Nama | Khoe-Kwadi | |
| nhe_Latn | nhe | Latn | Eastern Huasteca Nahuatl | Uto-Aztecan | |
| nld_Latn | nld | Latn | Standard Dutch | Indo-European | |
| nlv_Latn | nlv | Latn | Orizaba Nahuatl | Uto-Aztecan | |
| nno_Latn | nno | Latn | Nynorsk | Indo-European | |
| npi_Deva | npi | Deva | Nepali | Indo-European | |
| nso_Latn | nso | Latn | Northern Sotho | Atlantic-Congo | |
| nus_Latn | nus | Latn | Nuer | Nilotic | |
| nya_Latn | nya | Latn | Nyanja | Atlantic-Congo | |
| ory_Orya | ory | Orya | Oriya | Indo-European | |
| pbs_Latn | pbs | Latn | Central Pame | Otomanguean | |
| pbt_Arab | pbt | Arab | Southern Pashto | Indo-European | |
| pcm_Latn | pcm | Latn | Nigerian Pidgin | Indo-European | |
| pes_Arab | pes | Arab | Western Persian | Indo-European | |
| plt_Latn | plt | Latn | Plateau Malagasy | Austronesian | |
| pnb_Guru | pnb | Guru | Western Punjabi | Indo-European | |
| pol_Latn | pol | Latn | Polish | Indo-European | |
| por_Latn | por | Latn | Brazilian Portuguese | Indo-European | |
| quc_Latn | quc | Latn | K'iche' | Mayan | |
| quh_Latn | quh | Latn | South Bolivian Quechua | Quechuan | |
| quz_Latn | quz | Latn | Cusco Quechua | Quechuan | |
| rob_Latn | rob | Latn | Tae’ | Austronesian | |
| roh_Latn | roh | Latn | Romansh | Indo-European | |
| ron_Latn | ron | Latn | Romanian | Indo-European | |
| rus_Cyrl | rus | Cyrl | Russian | Indo-European | |
| sat_Olck | sat | Olck | Santali | Austroasiatic | |
| sba_Latn | sba | Latn | Ngambay | Central Sudanic | |
| scn_Latn | scn | Latn | Sicilian | Indo-European | |
| sgc_Latn | sgc | Latn | Kipsigis | Nilotic | |
| shn_Mymr | shn | Mymr | Shan | Tai-Kadai | |
| sif_Latn | sif | Latn | Siamou | Siamou | |
| sin_Sinh | sin | Sinh | Sinhala | Indo-European | |
| skr_Arab | skr | Arab | Saraiki | Indo-European | |
| slk_Latn | slk | Latn | Slovak | Indo-European | |
| slv_Latn | slv | Latn | Slovene | Indo-European | |
| sme_Latn | sme | Latn | Northern Sami | Uralic | |
| sna_Latn | sna | Latn | Shona | Atlantic-Congo | |
| snd_Arab | snd | Arab | Sindhi | Indo-European | |
| som_Latn | som | Latn | Somali | Afro-Asiatic | |
| sot_Latn | sot | Latn | Southern Sotho | Atlantic-Congo | |
| spa_Latn | spa | Latn | Spanish | Indo-European | |
| sro_Latn | sro | Latn | Sardinian Campidanese | Indo-European | |
| srp_Cyrl | srp | Cyrl | Serbian | Indo-European | |
| ssw_Latn | ssw | Latn | Swati | Atlantic-Congo | |
| sun_Latn | sun | Latn | Sundanese | Austronesian | |
| swe_Latn | swe | Latn | Swedish | Indo-European | |
| swh_Latn | swh | Latn | Swahili | Atlantic-Congo | |
| szl_Latn | szl | Latn | Silesian | Indo-European | |
| tam_Latn | tam | Latn | Tamil (Romanized) | Dravidian | |
| tam_Taml | tam | Taml | Tamil | Dravidian | |
| taq_Latn | taq | Latn | Tamashek (Romanized) | Afro-Asiatic | |
| taq_Tfng | taq | Tfng | Tamashek | Afro-Asiatic | |
| tat_Cyrl | tat | Cyrl | Tatar | Turkic | |
| tda_Latn | tda | Latn | Tagdal | Songhay | |
| tel_Latn | tel | Latn | Telugu (Romanized) | Dravidian | |
| tel_Telu | tel | Telu | Telugu | Dravidian | |
| tgk_Cyrl | tgk | Cyrl | Tajik | Indo-European | |
| tgl_Latn | tgl | Latn | Tagalog | Austronesian | |
| tha_Thai | tha | Thai | Thai | Tai-Kadai | |
| tir_Ethi | tir | Ethi | Tigrinya | Afro-Asiatic | |
| toc_Latn | toc | Latn | Coyutla Totonac | Totonacan | |
| tpi_Latn | tpi | Latn | Tok Pisin | Indo-European | |
| tpl_Latn | tpl | Latn | Tlacoapa Me’phaa | Otomanguean | |
| tsg_Latn | tsg | Latn | Tausug | Austronesian | |
| tsn_Latn | tsn | Latn | Tswana | Atlantic-Congo | |
| tso_Latn | tso | Latn | Tsonga | Atlantic-Congo | |
| tsz_Latn | tsz | Latn | Purepecha | Tarascan | |
| tui_Latn | tui | Latn | Tupuri | Atlantic-Congo | |
| tur_Latn | tur | Latn | Turkish | Turkic | |
| twi_Latn | twi | Latn | Twi | Atlantic-Congo | |
| tzh_Latn | tzh | Latn | Tzeltal | Mayan | |
| tzm_Tfng | tzm | Tfng | Central Atlas Tamazight | Afro-Asiatic | |
| uig_Arab | uig | Arab | Uyghur | Turkic | |
| ukr_Cyrl | ukr | Cyrl | Ukrainian | Indo-European | |
| umb_Latn | umb | Latn | Umbundu | Atlantic-Congo | |
| urd_Arab | urd | Arab | Urdu | Indo-European | |
| urd_Latn | urd | Latn | Urdu (Romanized) | Indo-European | |
| uzn_Latn | uzn | Latn | Northern Uzbek | Turkic | |
| ven_Latn | ven | Latn | Venda | Atlantic-Congo | |
| vie_Latn | vie | Latn | Vietnamese | Austroasiatic | |
| vmw_Latn | vmw | Latn | Makhuwa | Atlantic-Congo | |
| war_Latn | war | Latn | Waray | Austronesian | |
| wlv_Latn | wlv | Latn | Bermejo Wichí | Mataguayan | |
| wol_Latn | wol | Latn | Wolof | Atlantic-Congo | |
| wuu_Hans | wuu | Hans | Wu Chinese | Sino-Tibetan | |
| xho_Latn | xho | Latn | Xhosa | Atlantic-Congo | |
| xuu_Latn | xuu | Latn | Khwedam | Khoe-Kwadi | |
| ydd_Hebr | ydd | Hebr | Eastern Yiddish | Indo-European | |
| ydg_Arab | ydg | Arab | Yadgha | Indo-European | |
| yor_Latn | yor | Latn | Yoruba | Atlantic-Congo | |
| yua_Latn | yua | Latn | Yucateco | Mayan | |
| yue_Hant | yue | Hant | Yue Chinese | Sino-Tibetan | |
| zai_Latn | zai | Latn | Isthmus Zapotec | Otomanguean | |
| zsm_Latn | zsm | Latn | Colloquial Malay | Austronesian | Code-switching with Standard Malay (zlm_Latn) |
| zne_Latn | zne | Latn | Zande | Atlantic-Congo | |
| zul_Latn | zul | Latn | Zulu | Atlantic-Congo |
The base dataset has been created manually from scratch by professional linguists, by composing the source sentences that cover a variety of domains and registers in 8 diverse non-English languages: Egyptian Arabic (alternating with Modern Standard Arabic when appropriate), French, German, Hindi, Indonesian, Mandarin Chinese, Russian, and Spanish.
For each of the source languages, the sentences have been created in the following 8 domains:
How-to, written tutorials or instructions
Conversations (dialogues)
Narration (creative writing that doesn’t include dialogues)
Social media posts
Social media comments (reactive)
Other web content
Reflective piece
Miscellaneous (address to a nation, disaster response, etc.)
Apart from the domains, a variety of registers (contextual styles) were used. Each sentence is annotated with the register characterized by three features: connectedness, preparedness, and social differential.
The linguists who were creating the dataset were instructed to maintain the diversity of sentence lengths, word orders, sentence structures, and other linguistic characteristics.
Subsequently, the source sentences were translated from the 8 source languages into English, and then, into the other languages.
See the paper for more details.
To contribute to the dataset (adding translations for a new language, or verifying some of the existing translations), please use the web annotation tool at https://bouquet.metademolab.com. Please contact us in case of any questions!
If you are referring to this dataset, please cite the BOUQuET paper and the Omnilingual MT paper.
@inproceedings{andrews-etal-2025-bouquet,
title = "{BOUQ}u{ET} : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation",
author = "Andrews, Pierre and
Artetxe, Mikel and
Meglioli, Mariano Coria and
Costa-juss{\`a}, Marta R. and
Chuang, Joe and
Dale, David and
Duppenthaler, Mark and
Ekberg, Nathanial Paul and
Gao, Cynthia and
Licht, Daniel Edward and
Maillard, Jean and
Mourachko, Alexandre and
Ropers, Christophe and
Saleem, Safiyyah and
S{\'a}nchez, Eduardo and
Tsiamas, Ioannis and
Turkatenko, Arina and
Ventayol-Boada, Albert and
Yates, Shireen",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
m nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1400/",
doi = "10.18653/v1/2025.emnlp-main.1400",
pages = "27515--27535",
ISBN = "979-8-89176-332-6",
}
@misc{omnilingual2026,
title={Omnilingual {MT}: Machine Translation for 1,600 Languages},
author={The Omnilingual MT Team and Belen Alastruey and Niyati Bafna and Andrea Caciolai and Kevin Heffernan and Artyom Kozhevnikov and Christophe Ropers and Eduardo S{\'a}nchez and Charles-Eric Saint-James and Ioannis Tsiamas and Chierh Cheng and Joe Chuang and Paul-Ambroise Duquenne and Mark Duppenthaler and Nate Ekberg and Cynthia Gao and Pere Llu{\'i}s Huguet Cabot and Jo{\~a}o Maria Janeiro and Jean Maillard and Gabriel Mejia Gonzalez and Holger Schwenk and Edan Toledo and Arina Turkatenko and Albert Ventayol-Boada and Rashel Moritz and Alexandre Mourachko and Surya Parimi and Mary Williamson and Shireen Yates and David Dale and Marta R. Costa-juss{\`a}},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.16309},
}
Domain. By the term domain, we mean different spaces in which language is produced in speech, sign, or writing (e.g., books, social media, news, Wikipedia, organization websites, official documents, direct messaging, texting). In this dataset, we focus solely on the written modality.
Register. We understand the term register as a functional variety of language that includes socio-semiotic properties, as expressed in [Halliday and Matthiessen (2004)], or more simply as a "contextual style", as presented in [Labov (1991), pp.79–99]. In that regard, a register is a specific variety of language used to best fit a specific communicative purpose in a specific situation.