Common Voice Scripted Speech 26.0 - French

Français — French (`fr`)

This datasheet is for cv-corpus-26.0-2026-06-12 of the Mozilla Common Voice Scripted Speech dataset for French [Français - fr]. The dataset contains 868857 clips representing 1216.01 hours of recorded speech (1102.27 hours validated) from 21082 speakers, recorded from a text corpus of 1,692,942 sentences.

Language

French is a Romance language. It is the official language of 26 countries and is spoken across around 50 countries.

Variants

Code	Variant	Clips	Speakers
fr-metro	Français de métropole	545,761 (62.8%)	4,652 (22.1%)
fr-europe	Français d'Europe	26,971 (3.1%)	452 (2.1%)
fr-namerica	Français d'Amérique du Nord	14,424 (1.7%)	309 (1.5%)
fr-safrica	Français d'Afrique subsaharienne et des îles africaines	2,071 (0.2%)	88 (0.4%)
fr-droum	Français des départements et régions d'outre-mer	1,936 (0.2%)	40 (0.2%)
fr-nafrica	Français du nord de l'Afrique	1,454 (0.2%)	62 (0.3%)
fr-samerica	Français d'Amérique du Sud et des Caraïbes	90 (0.0%)	6 (0.0%)

Accents

Code	Accent	Clips	Speakers
canada	Français du Canada	12,869 (1.5%)	275 (1.3%)
belgium	Français de Belgique	11,381 (1.3%)	226 (1.1%)
switzerland	Français de Suisse	6,169 (0.7%)	143 (0.7%)
united_states	Français des États-Unis	1,610 (0.2%)	41 (0.2%)
reunion	Français de La Réunion	1,307 (0.2%)	16 (0.1%)
benin	Français du Bénin	1,073 (0.1%)	7 (0.0%)
algeria	Français d’Algérie	1,070 (0.1%)	26 (0.1%)
germany	Français d’Allemagne	552 (0.1%)	26 (0.1%)
fr-metro-north	Français du nord de la France	550 (0.1%)	5 (0.0%)
united_kingdom	Français du Royaume-Uni	502 (0.1%)	25 (0.1%)
haiti	Français d’Haïti	498 (0.1%)	7 (0.0%)
fr-metro-south	Français du sud de la France	319 (0.0%)	12 (0.1%)
madagascar	Français de Madagascar	283 (0.0%)	12 (0.1%)
fr-metro-east	Français de l'est de la France	234 (0.0%)	5 (0.0%)
morocco	Français du Maroc	211 (0.0%)	30 (0.1%)
cote_d_ivoire	Français de Côte d’Ivoire	201 (0.0%)	18 (0.1%)
senegal	Français du Sénégal	197 (0.0%)	16 (0.1%)
french_guiana	Français de Guyane	188 (0.0%)	3 (0.0%)
fr-metro-west	Français de l'ouest de la France	186 (0.0%)	11 (0.1%)
guadeloupe	Français de Guadeloupe	175 (0.0%)	13 (0.1%)
italy	Français d’Italie	171 (0.0%)	9 (0.0%)
cameroon	Français du Cameroun	163 (0.0%)	16 (0.1%)
new_caledonia	Français de Nouvelle-Calédonie	159 (0.0%)	3 (0.0%)
romania	Français de Roumanie	150 (0.0%)	6 (0.0%)
tunisia	Français de Tunisie	121 (0.0%)	16 (0.1%)
monaco	Français de Monaco	111 (0.0%)	3 (0.0%)
netherlands	Français des Pays-Bas	101 (0.0%)	4 (0.0%)
martinique	Français de Martinique	100 (0.0%)	7 (0.0%)
congo_kinshasa	Français du Congo (Kinshasa)	48 (0.0%)	6 (0.0%)
mali	Français du Mali	39 (0.0%)	4 (0.0%)
luxembourg	Français du Luxembourg	20 (0.0%)	3 (0.0%)
st_pierre_et_miquelon	Français de Saint-Pierre-et-Miquelon	15 (0.0%)	1 (0.0%)
mayotte	Français de Mayotte	12 (0.0%)	1 (0.0%)
mauritius	Français de l’Île Maurice	10 (0.0%)	2 (0.0%)
comoros	Français des Comores	5 (0.0%)	1 (0.0%)
-	Other	7,516 (0.9%)	245 (1.2%)

Demographic information

The dataset includes the following self-declared age and gender distributions. A coverage summary is shown below each table.

Gender

Self-declared gender information. The table shows clip and speaker counts with percentages. Speakers who did not declare a gender are listed as Unspecified. A dash (-) indicates zero.

Code	Gender	Clips	Speakers
male_masculine	Male, masculine	494,043 (56.9%)	3,878 (18.4%)
female_feminine	Female, feminine	92,799 (10.7%)	1,013 (4.8%)
transgender	Transgender	5 (0.0%)	1 (0.0%)
non-binary	Non-binary	264 (0.0%)	3 (0.0%)
do_not_wish_to_say	Prefer not to say	302 (0.0%)	4 (0.0%)
-	Unspecified	281,444 (32.4%)	16,863 (80.0%)

Gender declared: 587,413 of 868,857 clips (67.6%), 4,219 of 21,082 speakers (20.0%)

Age

Self-declared age information. The table shows clip and speaker counts with percentages. Speakers who did not declare an age are listed as Unspecified. A dash (-) indicates zero.

Code	Age	Clips	Speakers
teens	Teens	24,620 (2.8%)	445 (2.1%)
twenties	Twenties	147,945 (17.0%)	1,733 (8.2%)
thirties	Thirties	125,487 (14.4%)	1,156 (5.5%)
fourties	Fourties	123,959 (14.3%)	857 (4.1%)
fifties	Fifties	83,359 (9.6%)	498 (2.4%)
sixties	Sixties	29,118 (3.4%)	328 (1.6%)
seventies	Seventies	9,375 (1.1%)	121 (0.6%)
eighties	Eighties	212 (0.0%)	7 (0.0%)
nineties	Nineties	5 (0.0%)	1 (0.0%)
-	Unspecified	324,777 (37.4%)	16,694 (79.2%)

Age declared: 544,080 of 868,857 clips (62.6%), 4,388 of 21,082 speakers (20.8%)

Data splits for modelling

Clip buckets

Bucket	Clips
Validated	787,590 (90.6%)
Invalidated	68,479 (7.9%)
Other	12,788 (1.5%)

Training splits

Split	Clips
Train	617,587 (78.4%)
Dev	16,204 (2.1%)
Test	16,204 (2.1%)

Training split coverage: 649,995 of 787,590 validated clips (82.5%)

The dataset contains 787590 validated, 68479 invalidated, and 12788 unresolved clips. The average clip duration is 5.038 seconds.

Text corpus

Validated sentences: 1,649,352

Category	Count
Unvalidated sentences	43,590
Pending sentences	43,455
Rejected sentences	135
Reported sentences	7,574

The corpus contains 1,692,942 sentences: 1,649,352 validated and 43,590 unvalidated (43,455 pending review, 135 rejected), with 7,574 reported for review.

Writing system

The French language uses the 26 letters of the Latin alphabet with the addition of two ligatures (æ, œ) and five diacritics.

Symbol table

a à â æ b c ç d e é è ê ë f g h i î ï j k l m n ô œ p q r s t u ù û ü v w x y ÿ z

Sample

There follows a randomly selected sample of five sentences from the corpus.

Après la mort du pontife, il revint dans le royaume de Naples.
Le township est baptisé aux vergers de pommiers situés dans ses limites.
La gélatine est l'émulsion qui contient les pigments.
Durant cette semaine, diverses compétitions inter-universités et remises de récompenses ont également lieu.
Son père était directeur d'école et sa mère enseignait le dessin.

Sources

Source	Sentences
wiki-2	719,731 (43.8%)
wiki-1	717,145 (43.7%)
sentence-collector	103,289 (6.3%)
issue2259_deleted_export_readd_fixed	62,385 (3.8%)
Other	39,330 (2.4%)

Text domains

Code	Domain	Clips	Speakers
general	General	71 (0.0%)	54 (0.3%)
agriculture_food	Agriculture and Food	-	-
automotive_transport	Automotive and Transport	1 (0.0%)	1 (0.0%)
finance	Finance	1 (0.0%)	1 (0.0%)
service_retail	Service and Retail	-	-
healthcare	Healthcare	5 (0.0%)	4 (0.0%)
history_law_government	History, Law and Government	19 (0.0%)	17 (0.1%)
media_entertainment	Media and Entertainment	18 (0.0%)	14 (0.1%)
nature_environment	Nature and Environment	8 (0.0%)	8 (0.0%)
news_current_affairs	News and Current Affairs	2 (0.0%)	2 (0.0%)
technology_robotics	Technology and Robotics	18 (0.0%)	12 (0.1%)
language_fundamentals	Language Fundamentals	7 (0.0%)	5 (0.0%)

Fields

Clips

Each row of a tsv file represents a single audio clip, and contains the following information:

client_id - hashed UUID of a given user
path - relative path of the audio file
sentence - the sentence to be read aloud
sentence_id - unique identifier for the sentence
sentence_domain - domain classification(s) of the sentence
up_votes - number of people who said audio matches the text
down_votes - number of people who said audio does not match text
age - age of the speaker1
gender - gender of the speaker1
accents - accents of the speaker1
variant - variant of the language1
locale - locale code of the language
segment - if sentence belongs to a custom dataset segment, it will be listed here

`validated_sentences.tsv`

The validated_sentences.tsv file contains one row per validated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
is_used - whether the sentence is still in circulation for recording
clips_count - number of clips recorded for this sentence

`unvalidated_sentences.tsv`

The unvalidated_sentences.tsv file contains one row per unvalidated sentence in the text corpus:

sentence_id - unique identifier for the sentence
sentence - the sentence text
variant - the variant of the language
sentence_domain - the domain(s) the sentence belongs to
source - the source the sentence was collected from
up_votes - number of upvotes the sentence received
down_votes - number of downvotes the sentence received
status - current status of the sentence (pending or rejected)

Get involved

Community links

Discussions

Contribute

Licence

This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.

Footnotes

For a full list of age, gender, and accent options, see the demographics spec. These will only be reported if the speaker opted in to provide that information. ↩ ↩2 ↩3 ↩4

Common Voice Scripted Speech 26.0 - French

Description

Specifics

Considerations

Processes

Metadata

Français — French (`fr`)

Language

Variants

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Writing system

Symbol table

Sample

Sources

Text domains

Fields

Clips

`validated_sentences.tsv`

`unvalidated_sentences.tsv`

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Common Voice Scripted Speech 26.0 - French

Description

Specifics

Considerations

Processes

Metadata

Français — French (fr)

Language

Variants

Accents

Demographic information

Gender

Age

Data splits for modelling

Text corpus

Writing system

Symbol table

Sample

Sources

Text domains

Fields

Clips

validated_sentences.tsv

unvalidated_sentences.tsv

Get involved

Community links

Discussions

Contribute

Licence

Footnotes

Français — French (`fr`)

`validated_sentences.tsv`

`unvalidated_sentences.tsv`