Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement

التفاصيل البيبلوغرافية
العنوان:	Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement
المؤلفون:	Zhang, Shucong, Chadwick, Malcolm, Ramos, Alberto Gil C. P., Bhattacharya, Sourav
سنة النشر:	2022
المجموعة:	Computer Science
مصطلحات موضوعية:	Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
الوصف:	Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2211.04346
رقم الأكسشن:	edsarx.2211.04346
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.