Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese

التفاصيل البيبلوغرافية
العنوان:	Improving Few-Shot Multi-Speaker Text-to-Speech Adaptive-Based with Extracting Mel-Vector (EMV) for Vietnamese
المؤلفون:	Ngoc, Phuong Pham, Quang, Chung Tran, Chi, Mai Luong
المصدر:	International Journal of Asian Language Processing; September 2022, Vol. 32 Issue: 2-3
مستخلص:	Training a multi-speaker Text-to-Speech (TTS) model requires multiple speakers’ voices to generate an average speech model. However, the average speech synthesis model will be distorted or averaged, resulting in low quality if the new speaker’s voice has too little data to train. The existing methods require fine-tuning the model; otherwise, the model will achieve low adaptive quality. However, for synthesis voice to achieve high adaptive quality, at least thousands of fine-tuning steps are required. To solve these issues, in this paper, we propose a Vietnamese multi-speaker TTS adaptive-based technique that synthesizes high-quality speech and effectively adapts to new speakers, with two main improvements: (1) propose an Extracting Mel-Vector (EMV) architecture with three components, the Encoder–Decoder–Embedding Features, which enables complete learning of speaker features with Mel-spectrograms as input for few-shot training and (2) a continuous-learning technique called “data-distributing” preserves the new speaker’s characteristics after many training epochs. Our proposed model outperformed the baseline multi-speaker synthesis model and achieved a MOS score of 3.8/4.6 and SIM of 2.6/4 with only 1 min of the target speaker’s voice.
قاعدة البيانات:	Supplemental Index

الوصف
تدمد:	27175545 2424791X
DOI:	10.1142/S2717554523500042