Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control

التفاصيل البيبلوغرافية
العنوان: Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control
المؤلفون: Damien Lolive, Aghilas Sini, Elisabeth Delais-Roussarie, Sébastien Le Maguer
المساهمون: Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Bretagne Sud (UBS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-CentraleSupélec-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Expressiveness in Human Centered Data/Media (EXPRESSION), MEDIA ET INTERACTIONS (IRISA-D6), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Bretagne Sud (UBS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), ADAPT Centre, Sigmedia Lab, EE Engineering, Trinity College Dublin, Laboratoire de Linguistique de Nantes (LLING), Centre National de la Recherche Scientifique (CNRS)-Université de Nantes - UFR Lettres et Langages (UFRLL), Université de Nantes (UN)-Université de Nantes (UN), Université de Rennes (UR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes)
بيانات النشر: HAL CCSD, 2020.
سنة النشر: 2020
مصطلحات موضوعية: Computer science, Feature vector, Speech recognition, Multi-speaker TTS, Speech synthesis, 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], External Data Representation, computer.software_genre, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Deep Learning, 0202 electrical engineering, electronic engineering, information engineering, Expressive TTS, [INFO]Computer Science [cs], Artificial neural network, business.industry, Deep learning, 020206 networking & telecommunications, Speaker recognition, Identifier, Speaker control, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Coding (social sciences)
الوصف: International audience; To have more control over Text-to-Speech (TTS) synthesis and to improve expressivity, it is necessary to disentangle prosodic information carried by the speaker's voice identity from the one belonging to linguistic properties. In this paper, we propose to analyze how information related to speaker voice identity affects a Deep Neural Network (DNN) based multi-speaker speech synthesis model. To do so, we feed the network with a vector encoding speaker information in addition to a set of basic linguistic features. We then compare three main speaker coding configurations: a) simple one-hot vector describing the speaker gender and identifier ; b) an embedding vector extracted from a speaker recognition pre-trained model ; c) a prosodic vector which summarizes information such as melody, intensity, and duration. To measure the impact of the input feature vector, we investigate the representation of the latent space at the output of the first layer of the network. The aim is to have an overview of our data representation and model behavior. Furthermore, we conducted a subjective assessment to validate the result. Results show that the prosodic identity of the speaker is captured by the model and therefore allows the user to control more precisely synthesis.
اللغة: English
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::420c0d73a4fe35c0d971c1a2b14dc2b3
https://hal.archives-ouvertes.fr/hal-03000148/document
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....420c0d73a4fe35c0d971c1a2b14dc2b3
قاعدة البيانات: OpenAIRE