LezizNet - Turkish food images | Mozilla Data Collective

Description

LezizNet is an openly-licensed image dataset of Turkish cuisine, sourced from Openverse (which aggregates Flickr, Wikimedia Commons, iNaturalist and others) and from Wikimedia Commons directly. It contains 3,272 images covering 245 distinct dish labels, each row carrying the original source URL, creator, and exact Creative Commons license for reproducible attribution. Some labels are multi-label (a plate of "kuru fasulye, pilav" is tagged with both), reflecting how Turkish meals are actually served. The dataset was built SEMI-AUTOMATICALLY: images were scraped and labels auto-derived (from titles/tags or the search query), then manually reviewed and cleaned. It was created by querying ~120 Turkish food terms, deduplicating, filtering out non-Turkish and non-food images through a two-stage manual + CLIP-based review (~40% of scraped images were rejected), and labeling each image from its title/tags, search query, visual inspection, or by hand. Because labels are partly automatic, LABELING ERRORS REMAIN — the `food_name_source` and `label_confidence` columns flag how each label was derived and how reliable it is. Corrections and contributions are welcome (contact below).

Specifics

Licensing

Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)

Contents. 3,272 images (images/) and a metadata.tsv with one row per image.

Statistics.

Images: 3,272 | Distinct dish labels: 245 | Multi-label images: 380 (11%)
Sources: Wikimedia Commons 2,232, Flickr 1,038, iNaturalist 2 (via Openverse 3,181 + direct Wikimedia 91)
Licenses: CC-BY-SA 2,569 (79%), CC-BY 692 (21%), CC0/Public Domain 11
Label provenance (food_name_source): title/tags 2,326 (71%), search query 758 (23%), vision/AI-assisted 181 (6%), manual 7
Most frequent dishes: kebap (270), baklava (242), dolma (183), köfte (170), pilav (158), döner (155), simit (130), ayran (109), sarma (90), börek (84)
~1.4 MB average per image

Fields (full descriptions in README.md): filepath, filename, source (hosting platform), origin_scrape (openverse / wikimedia_commons), source_id, title, food_name (dish label(s), comma-separated for multi-dish plates), label_confidence (high/low/empty), food_name_source (how the label was derived), query_used, creator, source_url, license, license_url, flickr_tags, clarifai_tags, description, food_score (CLIP food-likelihood). Every image is fully attributable via creator + source_url + license.

Why this dataset. Food-recognition research has a documented double concentration: the large-scale datasets come from a handful of groups, and coverage is dominated by Western and East-Asian cuisines — Turkish and Middle-Eastern cuisines are sparse or absent (e.g. the community dataset World Wide Dishes contains no Turkish dish). Existing Turkish food datasets (TurkishFoods-15/-25, Turkish Food-102) are web-crawled from image search with unstated rights and cannot be cleanly redistributed or used commercially. This dataset fills the underpopulated quadrant of regional coverage AND clean licensing: to our knowledge it is the first openly-licensed Turkish food image dataset with reproducible, machine-readable per-image provenance. Food-specific pretraining has been shown to transfer substantially better than ImageNet baselines (Romero-Tapiador et al., 2024), so a clean regional corpus is useful as (a) in-domain fine-tuning data, (b) a held-out evaluation set exposing the geographic blind spots of food models and vision-language models, and (c) additional pretraining signal for an underrepresented cuisine. The multi-label design (real plates mix dishes) also makes it a testbed for intra-class and mixed-plate problems that single-label benchmarks hide.

Sources. Openverse (https://openverse.org) and Wikimedia Commons (https://commons.wikimedia.org). Methodology, sources, and full caveats are documented in README.md.

Construction, limitations & contributing. LezizNet is a semi-automatically created dataset: images were scraped and labels auto-derived, then manually reviewed and cleaned. It is not error-free — labeling errors remain, especially in auto-derived labels (food_name_source = search query or vision). The image set is also intentionally diverse: it includes Turkish food in real-world contexts (packaged products, people dining, market/serving scenes), not only isolated plated dishes, since modern vision-language models benefit from contextual imagery. This is a living dataset — corrections, additional images, and collaboration to improve it are welcome. Contact Alp Öktem (alp@oktem.me, https://alp.oktem.me/).

Please check README.md for more information.

LezizNet - Turkish food images

Description

Specifics

Considerations

Processes

Metadata