LibriVox Croatian TTS Male Voice
License:
CC0-1.0
Steward:
MDC CuratorsTask: TTS
Release Date: 4/14/2026
Format: MP3, TXT, TSV
Size: 377.60 MB
Share
Description
4 hours of sentence-aligned speech/text from "Priče iz Davnine" by Ivana Brlić Mažuranić (1874 - 1938) on LibriVox, containing over 2,000 sentences and 31,000 words.
Specifics
Considerations
Restrictions/Special Constraints
NA
Forbidden Usage
You agree not to attempt to determine the identity of speakers in this dataset
Processes
Intended Use
Training neural TTS acoustic models (e.g., FastSpeech, VITS, or similar architectures), Fine-tuning pre-trained multilingual TTS models for Croatian, Benchmarking Croatian speech synthesis quality, Linguistic research on Croatian prosody and phonetics
Metadata
Datasheet: Priče iz Davnine — Croatian TTS Dataset
Dataset Overview
| Language | Croatian (hr) |
| Source Text | Priče iz Davnine by Ivana Brlić-Mažuranić |
| Source Audio | LibriVox public domain recording (https://librivox.org/price-iz-davnine-by-ivana-brlic-mazuranic/) |
| Alignment | Sentence-level |
| Sentences | 2,032 |
| Words | 31,346 |
| License | CC-0 |
The Language
Croatian (hrvatski jezik) is a South Slavic language of the Indo-European family, written in the Latin script. It is the official language of Croatia and one of the official languages of Bosnia and Herzegovina, and is spoken by approximately 5–6 million people worldwide.
The Source Text
Title: Priče iz Davnine (English: Croatian Tales of Long Ago)
Author: Ivana Brlić-Mažuranić (18 April 1874 – 21 September 1938)
First Published: 1916
Genre: Fairy tales / Literary fiction
Priče iz Davnine is the most celebrated work of Ivana Brlić-Mažuranić, widely regarded as the greatest Croatian author of children's literature and sometimes called the "Croatian Hans Christian Andersen." The collection consists of nine fairy tales rooted in Slavic mythology and folklore, drawing on characters and motifs from pre-Christian South Slavic religion. The stories feature figures such as Stribog (the god of winds), the Domovoi (household spirits), and other mythological beings, woven into original narratives of moral depth and lyrical beauty. Priče iz Davnine reflects the literary Croatian of the era in which it was written. The language is largely intelligible to contemporary speakers, but carries a formal, archaic register characteristic of Croatian fairy-tale and epic literary tradition.
The text is in the public domain worldwide, as the author died in 1938 and more than 70 years have elapsed since her death.
Stories included in the collection:
Kako je Potjeh tražio istinu (01)
Ribar Palunko i njegova žena (02)
Regoč (03)
Sunce djever i Neva Nevičica (04)
Šuma Striborova (05)
Bratac Jaglenac i sestrica Rutvica, prvi dio (06)
Bratac Jaglenac i sestrica Rutvica, drugi dio (07)
The Source Audio
LibriVox is a volunteer-driven project founded in 2005 with the goal of recording all books in the public domain and making them freely available as audiobooks. All recordings are released into the public domain under the LibriVox license, meaning they may be freely used, distributed, and adapted for any purpose, including the creation of speech datasets.
Dataset Construction
Alignment Method
The dataset was constructed by sentence-aligning the source text with the LibriVox audio recording of Priče iz Davnine. In this case, "sentence" is a best approximation using sentence-final punctuation. In order to get sentence-level alignments, we used the Montreal Forced Aligner to produce word-level alignments and rolled these up to the sentence level.
Preprocessing
The original audio contains an introduction about Librivox that is not represented in the text. This was removed for each chapter.
The original mp3s used a variable bitrate. To ensure compatibility and make data validation easier, we converted to a constant bitrate (128kb/s) -Parentheses and brackets were removed.
newlines were replaced with single spaces, and sentences were split on sentence punctuation (.!?)
dashes and quotation marks were removed.
Intended Uses
This dataset is intended for use in training, fine-tuning, and evaluating text-to-speech (TTS) systems for Croatian. Potential applications include:
Training neural TTS acoustic models (e.g., FastSpeech, VITS, or similar architectures)
Fine-tuning pre-trained multilingual TTS models for Croatian
Benchmarking Croatian speech synthesis quality
Linguistic research on Croatian prosody and phonetics
Out-of-Scope Uses
Because the text is drawn from a single literary work in a formal and archaic register, this dataset is not representative of contemporary spoken Croatian or conversational speech. Models trained solely on this dataset may not generalize well to modern, colloquial, or domain-specific speech styles.
Limitations and Biases
Single speaker: The dataset contains audio from a single LibriVox volunteer reader. Speaker diversity is therefore absent.
Register: The language is formal and literary, with some archaic vocabulary, which may limit the naturalness of TTS output in everyday contexts.
Domain: The dataset covers a single book from a specific genre (fairy tales / mythology), limiting topical diversity.
Further Reading
Brlić-Mažuranić, I. (1916). Priče iz Davnine. Zagreb: St. Kugli.
LibriVox: https://librivox.org
Mozilla Common Voice & Data Collective: https://commonvoice.mozilla.org