Producing auditory content – what text to speech software can do

Alexa, Siri and Polly? We have known these ladies’ voices for years and some people even named the voice in their car’s navigation system many years ago. Text to Speech (T2S) has become an integral part of our everyday life: In public transportation as well as on telephone hotlines or in apps – written texts are increasingly being voiced by software solutions instead of actual humans.

Nowadays, corporate communications takes place across a number of media – videos, interactive training courses, factory tours, trade shows and many more. In such projects, it is not only of great relevance for the content to be multimedia but also to be multilingual! Text and sound should be available for all anticipated target languages of the respective company, while also of a high quality and created quickly and efficiently. Depending on the project’s scope, the number of target markets and the frequency with which the company requires similar content, T2S offers an economical alternative to the “traditional” human voice recording by so called “talents”. At the same time, the production of auditory content offers advantages over subtitling of content or can be created in addition to subtitles.

Requirements for Text to Speech – good preparation makes the difference

As standard, text to speech software is able to correctly „voice “a large amount of words in a text. Depending on the software solution, it is possible to choose between different voices per language. It is worth investing some time in testing which voice best represents the company’s content. As the software relies on the basic pronunciation rules of the respective language, some terms must be manually adapted by a linguist.

Among others, abbreviations, technical terms, loanwords and proper nouns belong to this category. In order for the machine or software to know how to pronounce these exceptions, the text is converted into phonetic spelling. In this mode, the word’s single phonemes can be adapted until the term is correctly audibly reproduced. So that the same terms do not have to be adapted again in the future, a pronunciation dictionary is created for the company and language, and new terms are added to it during every project. The manual follow-up effort of an automated transformation from text to phonetic spelling is thus reduced with every project, even though initially poses a factor which requires attention. Additionally, some fine tuning of the intonation of words within the structure of the sentence may be needed. Possible examples of such adaptations are breaks, the speech rate, raising or lowering of the voice at the end of a sentence and so on.

Following this preparation, the actual audio file can be produced, and this only takes a few minutes. The relationship between preparation and production time of a text to speech audio recording thus stands in direct contrast to the relationship of these factors when working with actual human voices. While the human voice-over artists requires at least as much time for recording as the length of the video, – excluding corrections– text to speech software needs extensive preparation, but the audio itself is created within a fragment of the original running time.


Not only do text to speech solutions save time and money, companies can also afford to produce much more content in more languages as translated audio files within the same budget than before. Moreover, factors such as consistency and recognition value are achieved much more easily with a software solution than with voice-over artists, as the voices of such software are always available and can be used in projects at short notice.

Hence, videos can always be produced with the same voice, without having to consider aspects such as resource planning. By using such software solutions for recording, the quality of the audio files remains consistent. Possible disruptive elements such as background noises can be ruled out.


Given careful preparation, the results of text to speech are still distinguishable from audio recordings using human voices; however, the quality is comparable. In some languages, mechanical voices sound more “natural” than in others, but generally a text to speech process is suitable for all languages. In order for the customer and project-specific terms to be conveyed comprehensively and fluently in every language, a sustainable preparation of content is essential in text to speech projects. Under these conditions, text to speech offers an attractive alternative to the traditional creation of audio material.

Would you like to learn more about text to speech processes at tsd? Please do not hesitate to contact us!