Synthetic Speech Samples

Topline
These systems use a conventional text processor as the front-end and HMM-based waveform generation.

Language Natural Sample 1 Sample 2 Sample 3
English Listen! Listen! Listen! Listen!
Finnish Listen! Listen! Listen! Listen!
Romanian Listen! Listen! Listen! Listen!
Spanish Listen! Listen! Listen! Listen!

Baseline 
These systems have limited or no text normalisation capability and so they require clean, normalised text input. They use grapheme units with shallow contextual features derived directly from the text, such as: left and right letter context; word boundaries; sentence boundaries; punctuation features. These systems make use of no expert-specified categories of letter and word, such as phonetic categories (vowel, nasal, approximant, etc.) and part of speech categories (noun, verb, adjective, etc.). Instead, Vector Space Model features that are designed to stand in for such expert knowledge but which are derived fully automatically from the distributional analysis of a large text corpus are used.

Language Natural Sample 1 Sample 2 Sample 3
English Listen! Listen! Listen! Listen!
Finnish Listen! Listen! Listen! Listen!
Romanian Listen! Listen! Listen! Listen!
Spanish Listen! Listen! Listen! Listen!
Indic Languages — IIIT-H corpora
Hindi Listen! Listen! Listen! Listen!
Kannada Listen! Listen! Listen! Listen!
Malayalam Listen! Listen! Listen! Listen!
Tamil Listen! Listen! Listen! Listen!
Telugu Listen! Listen! Listen! Listen!

Audiobook baseline
These systems use unconventional speech data (e.g. audiobooks, media interviews or speeches, podcasts), which were automatically segmented using a grapheme-based method. For Romanian, the text processing is the same as for the baseline systems. For the English audiobook a topline front-end was used.

Language Natural Sample 1 Sample 2 Sample 3
English Listen! Listen! Listen! Listen!
Romanian Listen! Listen! Listen! Listen!

 

Emotion Transplantation
This technique transplants the emotional speaking style from one source speaker, into any other target speakers, preserving the identity of the target speakers.

SPEAKERS NEUTRAL ANGRY HAPPY SAD SURPRISED
JOA (source speaker, natural speech) 001 004 007 010 013
UVD 016 028 036 031 040
JLC 019 051 053 055 057
JEC 022 066 068 069 071

Blizzard 2014 Indian language voices 
These samples are from systems submitted to the Blizzard Challenge 2014. As well as the unsupervised vector space model representations employed in our baseline systems, these systems benefit from several innovations: naive alphabetisation, unsupervised syllabification, and glottal flow pulse prediction using deep neural networks. For full details, see:

A. Suni, T. Raitio, D. Gowda, R. Karhila, M. Gibson, and O. Watts. The Simple4All entry to the Blizzard Challenge 2014. In Proc. of the Blizzard Challenge 2014 Workshop, Singapore, September 2014.

Language Natural Sample 1 Sample 2
Assamese Listen! Listen! Listen!
Gujarati Listen! Listen! Listen!
Hindi Listen! Listen! Listen!
Rajasthani Listen! Listen! Listen!
Tamil Listen! Listen! Listen!
Telugu Listen! Listen! Listen!