Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

التفاصيل البيبلوغرافية
العنوان: Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge
المؤلفون: Yeow, Jun Wei, Tan, Ee-Leng, Bai, Jisheng, Peksi, Santi, Gan, Woon-Seng
سنة النشر: 2024
مصطلحات موضوعية: Electrical Engineering and Systems Science - Audio and Speech Processing
الوصف: This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet-Conformer architectures with Squeeze-and-Excitation blocks in order to introduce additional forms of channel- and spatial-wise attention. In order to improve SELD performance, we also utilize the Spatial Cue-Augmented Log-Spectrogram (SALSA) features over the commonly used log-mel spectra features for polyphonic SELD. We complement the existing Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset with the audio channel swapping technique and synthesize additional data using the SpatialScaper generator. We also perform distance scaling in order to prevent large distance errors from contributing more towards the loss function. Finally, we evaluate our approach on the evaluation subset of the STARSS23 dataset.
Comment: Technical report for DCASE 2024 Challenge Task 3
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.09021
رقم الأكسشن: edsarx.2407.09021
قاعدة البيانات: arXiv