License:
CC0-1.0
Steward:
MDC Community ConciergeTask: TTS
Release Date: 5/1/2026
Format: WAV, TSV
Size: 3.00 GB
Share
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
Restrictions/Special Constraints
N/A
Forbidden Usage
N/A
Intended Use
Text-to-speech (TTS) synthesis model training and evaluation; Neural speech synthesis benchmarking; Voice modeling, prosody, and pronunciation research; Educational and prototyping use cases for speech generation
| Field | Details |
|---|---|
| Name | The LJ Speech Dataset |
| Version | 1.1 (current) |
| Released | 2017 |
| Homepage | https://keithito.com/LJ-Speech-Dataset/ |
| License | Public Domain |
| Download Size | 2.6 GB |
The LJ Speech Dataset is a public domain, single-speaker speech dataset consisting of 13,100 short audio clips of a single female speaker reading passages from 7 non-fiction books. A text transcription is provided for each clip. Clips range from approximately 1 to 10 seconds in length, with a total duration of roughly 24 hours.
The dataset is widely used for training and benchmarking text-to-speech (TTS) synthesis systems. Its high audio quality, clean segmentation, and aligned transcripts make it a standard benchmark in neural speech synthesis research.
| Field | Details |
|---|---|
| Language | English (en) |
| Script | Latin |
| Dialect | American English |
All transcriptions and audio recordings are in English.
The audio was recorded in 2016–2017 by Linda Johnson as part of the LibriVox project. LibriVox is a volunteer-run initiative that produces free public domain audiobooks. The original LibriVox recordings were distributed as 128 kbps MP3 files, and as a result some clips may contain minor artifacts introduced by MP3 encoding.
Audio clips were segmented automatically based on silences in the recording. Clip boundaries generally align with sentence or clause boundaries, though not always.
Alignment and annotation were performed manually by Keith Ito. A quality assurance pass was done to ensure that text transcriptions accurately matched the spoken audio.
The dataset consists of excerpts from the following 7 public domain works (published 1884–1964):
| Title | Author | Year |
|---|---|---|
| The Chronicles of Newgate, Vol. 2 | Arthur Griffiths | 1884 |
| Arts and Crafts Essays | William Morris et al. | 1893 |
| Marion Harland's Cookery for Beginners | Marion Harland | 1893 |
| The Science-History of the Universe, Vol. 5: Biology | Francis Rolt-Wheeler | 1910 |
| The Seven Wonders of the Ancient World | Edgar J. Banks | 1916 |
| The Fireside Chats of Franklin Delano Roosevelt | Franklin D. Roosevelt | 1933–42 |
| Report of the President's Commission on the Assassination of President Kennedy | President's Commission | 1964 |
| Metric | Value |
|---|---|
| Total Clips | 13,100 |
| Total Words | 225,715 |
| Total Characters | 1,308,678 |
| Total Duration | 23 hr 55 min 17 sec |
| Mean Clip Duration | 6.57 sec |
| Min Clip Duration | 1.11 sec |
| Max Clip Duration | 10.10 sec |
| Mean Words per Clip | 17.23 |
| Distinct Words | 13,821 |
Audio: Single-channel 16-bit PCM WAV files at a sample rate of 22,050 Hz
Meta metadata.csv (pipe-delimited |), with three fields per record:
ID — name of the corresponding .wav file
Transcription — words spoken by the reader (UTF-8)
Normalized Transcription — transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8)
Text-to-speech (TTS) synthesis model training and evaluation
Neural speech synthesis benchmarking
Voice modeling, prosody, and pronunciation research
Educational and prototyping use cases for speech generation
Audio was originally encoded as 128 kbps MP3 by LibriVox; some clips may contain MP3 compression artifacts.
Clip boundaries are based on silence detection and do not always align precisely with sentence boundaries.
19 transcriptions contain non-ASCII characters (e.g., raison d'être).
The dataset contains a standard set of abbreviations (e.g., Mr., Dr., St.) with recommended expansions. Note that there is no standard expansion for Mrs.
This dataset is in the public domain in the United States, and most likely in other countries as well. There are no restrictions on its use.
"All text, audio, and annotations are in the public domain. We request that you use this dataset for good and not evil."
For more information, see: https://librivox.org/pages/public-domain
| Role | Person |
|---|---|
| Audio Recordings | Linda Johnson (via LibriVox) |
| Alignment & Annotation | Keith Ito |
There is no requirement to cite this work (as it is in the public domain), but if you wish to cite it in a publication, the authors suggest the following:
@misc{ljspeech17,
author = {Keith Ito and Linda Johnson},
title = {The LJ Speech Dataset},
howpublished = {\url{https://keithito.com/LJ-Speech-Dataset/}},
year = {2017}
}
Alternatively, you may link directly to: https://keithito.com/LJ-Speech-Dataset/
| Version | Notes |
|---|---|
| 1.1 (current) | Removed 30 .wav files that were present in v1.0 without corresponding annotations in metadata.csv. (Bug reported by Rafael Valle.) |
| 1.0 | Initial release |