Hybrid lightweight temporal-frequency analysis network for multi-channel speech enhancement
Abstract Speech signals captured by microphone arrays are often contaminated by noise and spatial reverberation, highlighting the importance of multi-channel speech enhancement (MCSE) in microphone array signal processing. In recent years, deep learning has led to significant advancements in MCSE ta...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-05-01
|
| Series: | EURASIP Journal on Audio, Speech, and Music Processing |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s13636-025-00408-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Speech signals captured by microphone arrays are often contaminated by noise and spatial reverberation, highlighting the importance of multi-channel speech enhancement (MCSE) in microphone array signal processing. In recent years, deep learning has led to significant advancements in MCSE tasks. Several models have been developed to effectively mitigate noise and reverberation, enhancing the quality and intelligibility of speech. However, they rely heavily on multi-head self-attention mechanisms to capture long-term features, resulting in high model complexity, which hinders their practical deployment. Lightweight models, which utilize numerous convolution layers for feature extraction, substantially reduce model complexity but tend to focus too much on local features, often achieving suboptimal performance. To address these limitations, we propose the Hybrid Lightweight Time-Frequency Analysis Network (HLTFA), which achieves an optimal trade-off between computational efficiency and feature extraction capabilities. The Lightweight Attentive Fourier Module (LAFM) is proposed as HLTFA’s encoder. It employs frequency-domain Fourier modules and time-domain lightweight attention modules to efficiently extract axial global features. Additionally, we propose a plug-and-play lightweight component, the High-Low Energy Module (HLEM), which captures spectral information from high and low energy regions, complementing the axial feature representation. Experimental results demonstrate that HLTFA outperforms the latest models while maintaining lower complexity. |
|---|---|
| ISSN: | 1687-4722 |