The LJSpeech Dataset

Description

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

N/A

Forbidden Usage

N/A

Processes

Intended Use

Text-to-speech (TTS) synthesis model training and evaluation; Neural speech synthesis benchmarking; Voice modeling, prosody, and pronunciation research; Educational and prototyping use cases for speech generation

Datasheet: The LJ Speech Dataset

Overview

Field	Details
Name	The LJ Speech Dataset
Version	1.1 (current)
Released	2017
Homepage	https://keithito.com/LJ-Speech-Dataset/
License	Public Domain
Download Size	2.6 GB

Dataset Description

The LJ Speech Dataset is a public domain, single-speaker speech dataset consisting of 13,100 short audio clips of a single female speaker reading passages from 7 non-fiction books. A text transcription is provided for each clip. Clips range from approximately 1 to 10 seconds in length, with a total duration of roughly 24 hours.

The dataset is widely used for training and benchmarking text-to-speech (TTS) synthesis systems. Its high audio quality, clean segmentation, and aligned transcripts make it a standard benchmark in neural speech synthesis research.

Language

Field	Details
Language	English (en)
Script	Latin
Dialect	American English

All transcriptions and audio recordings are in English.

Creation

Audio Recordings

The audio was recorded in 2016–2017 by Linda Johnson as part of the LibriVox project. LibriVox is a volunteer-run initiative that produces free public domain audiobooks. The original LibriVox recordings were distributed as 128 kbps MP3 files, and as a result some clips may contain minor artifacts introduced by MP3 encoding.

Audio clips were segmented automatically based on silences in the recording. Clip boundaries generally align with sentence or clause boundaries, though not always.

Transcription & Annotation

Alignment and annotation were performed manually by Keith Ito. A quality assurance pass was done to ensure that text transcriptions accurately matched the spoken audio.

Source Texts

The dataset consists of excerpts from the following 7 public domain works (published 1884–1964):

Title	Author	Year
The Chronicles of Newgate, Vol. 2	Arthur Griffiths	1884
Arts and Crafts Essays	William Morris et al.	1893
Marion Harland's Cookery for Beginners	Marion Harland	1893
The Science-History of the Universe, Vol. 5: Biology	Francis Rolt-Wheeler	1910
The Seven Wonders of the Ancient World	Edgar J. Banks	1916
The Fireside Chats of Franklin Delano Roosevelt	Franklin D. Roosevelt	1933–42
Report of the President's Commission on the Assassination of President Kennedy	President's Commission	1964

Dataset Statistics

Metric	Value
Total Clips	13,100
Total Words	225,715
Total Characters	1,308,678
Total Duration	23 hr 55 min 17 sec
Mean Clip Duration	6.57 sec
Min Clip Duration	1.11 sec
Max Clip Duration	10.10 sec
Mean Words per Clip	17.23
Distinct Words	13,821

File Format

Audio: Single-channel 16-bit PCM WAV files at a sample rate of 22,050 Hz
Meta metadata.csv (pipe-delimited |), with three fields per record:
1. ID — name of the corresponding .wav file
2. Transcription — words spoken by the reader (UTF-8)
3. Normalized Transcription — transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8)

Intended Use

Text-to-speech (TTS) synthesis model training and evaluation
Neural speech synthesis benchmarking
Voice modeling, prosody, and pronunciation research
Educational and prototyping use cases for speech generation

Known Limitations & Notes

Audio was originally encoded as 128 kbps MP3 by LibriVox; some clips may contain MP3 compression artifacts.
Clip boundaries are based on silence detection and do not always align precisely with sentence boundaries.
19 transcriptions contain non-ASCII characters (e.g., raison d'être).
The dataset contains a standard set of abbreviations (e.g., Mr., Dr., St.) with recommended expansions. Note that there is no standard expansion for Mrs.

License

This dataset is in the public domain in the United States, and most likely in other countries as well. There are no restrictions on its use.

"All text, audio, and annotations are in the public domain. We request that you use this dataset for good and not evil."

For more information, see: https://librivox.org/pages/public-domain

Creators

Role	Person
Audio Recordings	Linda Johnson (via LibriVox)
Alignment & Annotation	Keith Ito

Citation

There is no requirement to cite this work (as it is in the public domain), but if you wish to cite it in a publication, the authors suggest the following:

@misc{ljspeech17,
  author       = {Keith Ito and Linda Johnson},
  title        = {The LJ Speech Dataset},
  howpublished = {\url{https://keithito.com/LJ-Speech-Dataset/}},
  year         = {2017}
}

Alternatively, you may link directly to: https://keithito.com/LJ-Speech-Dataset/

Changelog

Version	Notes
1.1 (current)	Removed 30 `.wav` files that were present in v1.0 without corresponding annotations in `metadata.csv`. (Bug reported by Rafael Valle.)
1.0	Initial release

Description

Specifics

Considerations

Processes

Metadata

Datasheet: The LJ Speech Dataset

Overview

Dataset Description

Language

Creation

Audio Recordings

Transcription & Annotation

Source Texts

Dataset Statistics

File Format

Intended Use

Known Limitations & Notes

License

Creators

Citation

Changelog