LibriVox Croatian TTS Male Voice

Datasheet: Priče iz Davnine — Croatian TTS Dataset

Dataset Overview


Language	Croatian (`hr`)
Source Text	Priče iz Davnine by Ivana Brlić-Mažuranić
Source Audio	LibriVox public domain recording (https://librivox.org/price-iz-davnine-by-ivana-brlic-mazuranic/)
Alignment	Sentence-level
Sentences	2,032
Words	31,346
License	CC-0

The Language

Croatian (hrvatski jezik) is a South Slavic language of the Indo-European family, written in the Latin script. It is the official language of Croatia and one of the official languages of Bosnia and Herzegovina, and is spoken by approximately 5–6 million people worldwide.

The Source Text

Title: Priče iz Davnine (English: Croatian Tales of Long Ago)

Author: Ivana Brlić-Mažuranić (18 April 1874 – 21 September 1938)

First Published: 1916

Genre: Fairy tales / Literary fiction

Priče iz Davnine is the most celebrated work of Ivana Brlić-Mažuranić, widely regarded as the greatest Croatian author of children's literature and sometimes called the "Croatian Hans Christian Andersen." The collection consists of nine fairy tales rooted in Slavic mythology and folklore, drawing on characters and motifs from pre-Christian South Slavic religion. The stories feature figures such as Stribog (the god of winds), the Domovoi (household spirits), and other mythological beings, woven into original narratives of moral depth and lyrical beauty. Priče iz Davnine reflects the literary Croatian of the era in which it was written. The language is largely intelligible to contemporary speakers, but carries a formal, archaic register characteristic of Croatian fairy-tale and epic literary tradition.

The text is in the public domain worldwide, as the author died in 1938 and more than 70 years have elapsed since her death.

Stories included in the collection:

Kako je Potjeh tražio istinu (01)
Ribar Palunko i njegova žena (02)
Regoč (03)
Sunce djever i Neva Nevičica (04)
Šuma Striborova (05)
Bratac Jaglenac i sestrica Rutvica, prvi dio (06)
Bratac Jaglenac i sestrica Rutvica, drugi dio (07)

The Source Audio

LibriVox is a volunteer-driven project founded in 2005 with the goal of recording all books in the public domain and making them freely available as audiobooks. All recordings are released into the public domain under the LibriVox license, meaning they may be freely used, distributed, and adapted for any purpose, including the creation of speech datasets.

Dataset Construction

Alignment Method

The dataset was constructed by sentence-aligning the source text with the LibriVox audio recording of Priče iz Davnine. In this case, "sentence" is a best approximation using sentence-final punctuation. In order to get sentence-level alignments, we used the Montreal Forced Aligner to produce word-level alignments and rolled these up to the sentence level.

Preprocessing

The original audio contains an introduction about Librivox that is not represented in the text. This was removed for each chapter.
The original mp3s used a variable bitrate. To ensure compatibility and make data validation easier, we converted to a constant bitrate (128kb/s) -Parentheses and brackets were removed.
newlines were replaced with single spaces, and sentences were split on sentence punctuation (.!?)
dashes and quotation marks were removed.

Intended Uses

This dataset is intended for use in training, fine-tuning, and evaluating text-to-speech (TTS) systems for Croatian. Potential applications include:

Training neural TTS acoustic models (e.g., FastSpeech, VITS, or similar architectures)
Fine-tuning pre-trained multilingual TTS models for Croatian
Benchmarking Croatian speech synthesis quality
Linguistic research on Croatian prosody and phonetics

Out-of-Scope Uses

Because the text is drawn from a single literary work in a formal and archaic register, this dataset is not representative of contemporary spoken Croatian or conversational speech. Models trained solely on this dataset may not generalize well to modern, colloquial, or domain-specific speech styles.

Limitations and Biases

Single speaker: The dataset contains audio from a single LibriVox volunteer reader. Speaker diversity is therefore absent.
Register: The language is formal and literary, with some archaic vocabulary, which may limit the naturalness of TTS output in everyday contexts.
Domain: The dataset covers a single book from a specific genre (fairy tales / mythology), limiting topical diversity.

Description

Specifics

Considerations

Processes

Metadata