Text-to-Speech (TTS) synthesis, i.e., artificially produced speech, has finally attained a quality level that makes it possible to include it into ordinary services that are used by common people. With the increasing processing power of smartphones and the development of intelligent personal assistants like Siri, Cortana, and Google Now, synthetic speech started to affect even more people. Therefore, within the past couple of years, TTS has made its way from a geeky accessory to a normal part of everyday life.
Nonetheless, modern TTS systems still suffer from diverse quality con- straints: frequent concatenations and temporal manipulations in diphone syn- thesis cause discontinuous speech, HMM synthesis can lead to natural sound- ing but also very buzzy and muffled speech, and the quality of unit selection voices not only depends on the degree of the fit, but also on the appropriate- ness of the available speech units. Therefore, the resulting impairments all yield different perceptual impressions. Thus, the quality of synthetic speech is of multidimensional nature.
Therefore, research towards perceptual quality dimensions of synthetic speech is reviewed and two experiments towards perceptual quality are con- ducted. Their findings are compared with the state of the art and a set of five perceptual quality dimensions is derived. They are: (i) naturalness of voice, (ii) prosodic quality, (iii) fluency and intelligibility, (iv) absence of disturbances, and (v) calmness. Moreover, a test protocol is designed that recommends an experimental setup to assess these five dimensions.
In addition, several factors that influence these dimensions are analyzed. First, the findings of two studies show that the relevance of these dimensions shift depending on the use case (short messages readers vs. synthesized au- diobooks). Second, a significant effect of a speaker’s voice of a speech corpus is verified for all dimensions. And third, it is shown that the size of the speech corpus for unit selection voices significantly affects all dimensions.
Furthermore, different approaches towards instrumental quality assess- ment of synthetic speech are examined. Two linear regression models are developed and employed to estimate the quality of TTS signals. Even though they reach correlations between estimated score and auditory rating of up to .74, they are outperformed by two more complex, non-linear approaches. One of these non-linear measures is utilized with the aim to improve the quality of MaryTTS unit selection voices. Even though this goal could not be achieved, the study highlights different approaches to further improve the prediction accuracy and therefore also the quality of the generated voice.
Download this book @ http://www.springer.com/de/book/9789811037337