Antti Suni, Reima Karhila, Tuomo Raitio, Mikko Kurimo, Martti Vainio, Paavo Alku (2013): Lombard Modified Text-to-Speech Synthesis for Improved Intelligibility: Submission for the Hurricane Challenge 2013. In: Proc. Interspeech 2013, 2013. (Type: Inproceeding | Abstract | Links | BibTeX | Tags: GlottHMM, Hurricane challenge, intelligibility, Lombard speech, Speech synthesis)@inproceedings{Suni_IS13,
title = {Lombard Modified Text-to-Speech Synthesis for Improved Intelligibility: Submission for the Hurricane Challenge 2013},
author = {Antti Suni and Reima Karhila and Tuomo Raitio and Mikko Kurimo and Martti Vainio and Paavo Alku},
url = {http://consortium.simple4all.org/files/2013/11/suni13a.pdf},
year = {2013},
date = {2013-04-25},
booktitle = {Proc. Interspeech 2013},
abstract = {This paper describes modification of a TTS system for improving the intelligibility of speech in various noise conditions. First, the GlottHMM vocoder is used for training a voice with modal speech data. The vocoder and voice parameters are then modified to mimic the properties of Lombard effect based on a small amount of Lombard speech from the same speaker. More specifically, the durations are increased, fundamental frequency is raised, spectral tilt is decreased, the harmonic-to-noise ratio is increased, and a pressed glottal flow pulses are used in creating excitation. The formants of the speech are also enhanced and finally the speech is compressed in order to increase noise robustness of the voice. The evaluation results of the Hurricane Challenge 2013 indicate that the modified voice is mostly less intelligible than the unmodified natural speech, as expected, but more intelligible than the reference TTS voice, especially in the low SNR conditions.},
keywords = {GlottHMM, Hurricane challenge, intelligibility, Lombard speech, Speech synthesis}
}
This paper describes modification of a TTS system for improving the intelligibility of speech in various noise conditions. First, the GlottHMM vocoder is used for training a voice with modal speech data. The vocoder and voice parameters are then modified to mimic the properties of Lombard effect based on a small amount of Lombard speech from the same speaker. More specifically, the durations are increased, fundamental frequency is raised, spectral tilt is decreased, the harmonic-to-noise ratio is increased, and a pressed glottal flow pulses are used in creating excitation. The formants of the speech are also enhanced and finally the speech is compressed in order to increase noise robustness of the voice. The evaluation results of the Hurricane Challenge 2013 indicate that the modified voice is mostly less intelligible than the unmodified natural speech, as expected, but more intelligible than the reference TTS voice, especially in the low SNR conditions.
|
Tuomo Raitio, Antti Suni, Martti Vainio, Paavo Alku (2013): Synthesis and Perception of Breathy, Normal, and Lombard Speech in the Presence of Noise. In: Special issue of Computer Speech and Language on 'The Listening Talker', 2013. (Type: Article | Abstract | Links | BibTeX | Tags: Adaptation, Breathy speech, intelligibility, Lombard speech, statistical parametric speech synthesis, Vocal effort)@article{Raitio13b,
title = {Synthesis and Perception of Breathy, Normal, and Lombard Speech in the Presence of Noise},
author = {Tuomo Raitio and Antti Suni and Martti Vainio and Paavo Alku},
url = {http://dx.doi.org/10.1016/j.csl.2013.03.003},
year = {2013},
date = {2013-01-14},
journal = {Special issue of Computer Speech and Language on 'The Listening Talker'},
abstract = {This papers studies the synthesis of speech on a wide vocal effort continuum and its perception in the presence of noise. Three types of speech is recorded and studied along the continuum: breathy, normal, and Lombard speech. Corresponding synthetic voices are created by training and adapting statistical parametric speech synthesis system GlottHMM. Natural and synthetic speech along the continuum is assessed in listening tests that evaluate the intelligibility, quality, and suitability of speech in three different realistic multichannel noise conditions: silence, moderate street noise, and extreme street noise. The evaluation results are encouraging in showing that the synthesized voices with varying vocal effort are rated similarly to their natural counterparts both in terms of intelligibility and suitability.},
keywords = {Adaptation, Breathy speech, intelligibility, Lombard speech, statistical parametric speech synthesis, Vocal effort}
}
This papers studies the synthesis of speech on a wide vocal effort continuum and its perception in the presence of noise. Three types of speech is recorded and studied along the continuum: breathy, normal, and Lombard speech. Corresponding synthetic voices are created by training and adapting statistical parametric speech synthesis system GlottHMM. Natural and synthetic speech along the continuum is assessed in listening tests that evaluate the intelligibility, quality, and suitability of speech in three different realistic multichannel noise conditions: silence, moderate street noise, and extreme street noise. The evaluation results are encouraging in showing that the synthesized voices with varying vocal effort are rated similarly to their natural counterparts both in terms of intelligibility and suitability.
|
Tuomo Raitio, Marko Takanen, Olli Santala, Antti Suni, Martti Vainio, Paavo Alku (2012): On measuring the intelligibility of synthetic speech in noise – Do we need a realistic noise environment?. In: Proc. ICASSP 2012, pp. 4025-4028, IEEEE, 2012, ISSN: 1520-6149. (Type: Inproceeding | Abstract | Links | BibTeX | Tags: intelligibility, Lombard speech, multichannel reproduction, speech in noise)@inproceedings{Raitio_et_al_icassp2012,
title = {On measuring the intelligibility of synthetic speech in noise – Do we need a realistic noise environment?},
author = {Tuomo Raitio, Marko Takanen, Olli Santala, Antti Suni, Martti Vainio, Paavo Alku},
url = {http://dx.doi.org/10.1109/ICASSP.2012.6288801},
issn = {1520-6149},
year = {2012},
date = {2012-10-12},
booktitle = {Proc. ICASSP 2012},
pages = {4025-4028},
publisher = {IEEEE},
abstract = {Assessing the intelligibility of synthetic speech is important in creating synthetic voices to be used in real life applications, especially for the ones involving interfering noise. This raises the question how to measure the intelligibility of synthetic speech to correctly simulate such conditions. Conventionally, this has been done using a simple listening test setup where diotic speech and noise are played to both ears with headphones. This is indeed very different from the real noise environment where speech and noise are spatially distributed. This paper addresses the question whether a realistic noise environment should be used to test the intelligibility of synthetic speech. Three different test conditions, one with multichannel reproduction of noise and speech, and two headphone setups are evaluated. Tests are performed with natural and synthetic speech, including speech especially intended for noisy conditions. The results indicate a general trend in all setups but also some interesting differences.},
keywords = {intelligibility, Lombard speech, multichannel reproduction, speech in noise}
}
Assessing the intelligibility of synthetic speech is important in creating synthetic voices to be used in real life applications, especially for the ones involving interfering noise. This raises the question how to measure the intelligibility of synthetic speech to correctly simulate such conditions. Conventionally, this has been done using a simple listening test setup where diotic speech and noise are played to both ears with headphones. This is indeed very different from the real noise environment where speech and noise are spatially distributed. This paper addresses the question whether a realistic noise environment should be used to test the intelligibility of synthetic speech. Three different test conditions, one with multichannel reproduction of noise and speech, and two headphone setups are evaluated. Tests are performed with natural and synthetic speech, including speech especially intended for noisy conditions. The results indicate a general trend in all setups but also some interesting differences.
|