End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models
In this work, we propose a multi-modal speaker change detection (SCD) approach with focal loss, which integrates both audio and text features to enhance detection performance. The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mecha...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/8/4324 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In this work, we propose a multi-modal speaker change detection (SCD) approach with focal loss, which integrates both audio and text features to enhance detection performance. The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to optimize useful features related to speaker change. The extracted features are fused and processed through a fully connected classification network, with layer normalization and dropout for stability and generalization. To address class imbalance, we apply focal loss, which reduces errors for the difficult samples, leading to better balanced performance. Extensive experiments on a multi-talker meeting dataset demonstrate that the proposed multi-modal approach consistently outperforms single-modal models, proving the complementary nature of audio and text for SCD. Fine-tuning pre-trained models (Wav2Vec2 and Bert) for audio and text significantly boosts accuracy, achieving a 21% improvement over frozen models. The self-attention mechanism further improves performance by 2%, highlighting its ability to capture speaker transition cues effectively. Additionally, focal loss enhances the model’s performance, making it more robust to imbalanced data. |
|---|---|
| ISSN: | 2076-3417 |