Topline
These systems use a conventional text processor as the front-end and HMM-based waveform generation.
Language | Natural | Sample 1 | Sample 2 | Sample 3 |
---|---|---|---|---|
English | Listen! | Listen! | Listen! | Listen! |
Finnish | Listen! | Listen! | Listen! | Listen! |
Romanian | Listen! | Listen! | Listen! | Listen! |
Spanish | Listen! | Listen! | Listen! | Listen! |
Baseline
These systems have limited or no text normalisation capability and so they require clean, normalised text input. They use grapheme units with shallow contextual features derived directly from the text, such as: left and right letter context; word boundaries; sentence boundaries; punctuation features. These systems make use of no expert-specified categories of letter and word, such as phonetic categories (vowel, nasal, approximant, etc.) and part of speech categories (noun, verb, adjective, etc.). Instead, Vector Space Model features that are designed to stand in for such expert knowledge but which are derived fully automatically from the distributional analysis of a large text corpus are used.
Language | Natural | Sample 1 | Sample 2 | Sample 3 |
---|---|---|---|---|
English | Listen! | Listen! | Listen! | Listen! |
Finnish | Listen! | Listen! | Listen! | Listen! |
Romanian | Listen! | Listen! | Listen! | Listen! |
Spanish | Listen! | Listen! | Listen! | Listen! |
Indic Languages — IIIT-H corpora | ||||
Hindi | Listen! | Listen! | Listen! | Listen! |
Kannada | Listen! | Listen! | Listen! | Listen! |
Malayalam | Listen! | Listen! | Listen! | Listen! |
Tamil | Listen! | Listen! | Listen! | Listen! |
Telugu | Listen! | Listen! | Listen! | Listen! |
Audiobook baseline
These systems use unconventional speech data (e.g. audiobooks, media interviews or speeches, podcasts), which were automatically segmented using a grapheme-based method. For Romanian, the text processing is the same as for the baseline systems. For the English audiobook a topline front-end was used.
Language | Natural | Sample 1 | Sample 2 | Sample 3 |
---|---|---|---|---|
English | Listen! | Listen! | Listen! | Listen! |
Romanian | Listen! | Listen! | Listen! | Listen! |
Emotion Transplantation
This technique transplants the emotional speaking style from one source speaker, into any other target speakers, preserving the identity of the target speakers.
SPEAKERS | NEUTRAL | ANGRY | HAPPY | SAD | SURPRISED |
JOA (source speaker, natural speech) | 001 | 004 | 007 | 010 | 013 |
UVD | 016 | 028 | 036 | 031 | 040 |
JLC | 019 | 051 | 053 | 055 | 057 |
JEC | 022 | 066 | 068 | 069 | 071 |
Blizzard 2014 Indian language voices
These samples are from systems submitted to the Blizzard Challenge 2014. As well as the unsupervised vector space model representations employed in our baseline systems, these systems benefit from several innovations: naive alphabetisation, unsupervised syllabification, and glottal flow pulse prediction using deep neural networks. For full details, see:
A. Suni, T. Raitio, D. Gowda, R. Karhila, M. Gibson, and O. Watts. The Simple4All entry to the Blizzard Challenge 2014. In Proc. of the Blizzard Challenge 2014 Workshop, Singapore, September 2014.
Language | Natural | Sample 1 | Sample 2 |
---|---|---|---|
Assamese | Listen! | Listen! | Listen! |
Gujarati | Listen! | Listen! | Listen! |
Hindi | Listen! | Listen! | Listen! |
Rajasthani | Listen! | Listen! | Listen! |
Tamil | Listen! | Listen! | Listen! |
Telugu | Listen! | Listen! | Listen! |