End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

In this work, we propose a multi-modal speaker change detection (SCD) approach with focal loss, which integrates both audio and text features to enhance detection performance. The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mecha...

Full description

Saved in:
Bibliographic Details
Main Authors: Alymzhan Toleu, Gulmira Tolegen, Alexandr Pak, Jaxylykova Assel, Bagashar Zhumazhanov
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/8/4324
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this work, we propose a multi-modal speaker change detection (SCD) approach with focal loss, which integrates both audio and text features to enhance detection performance. The proposed approach utilizes pre-trained large-scale models for feature extraction and incorporates a self-attention mechanism to optimize useful features related to speaker change. The extracted features are fused and processed through a fully connected classification network, with layer normalization and dropout for stability and generalization. To address class imbalance, we apply focal loss, which reduces errors for the difficult samples, leading to better balanced performance. Extensive experiments on a multi-talker meeting dataset demonstrate that the proposed multi-modal approach consistently outperforms single-modal models, proving the complementary nature of audio and text for SCD. Fine-tuning pre-trained models (Wav2Vec2 and Bert) for audio and text significantly boosts accuracy, achieving a 21% improvement over frozen models. The self-attention mechanism further improves performance by 2%, highlighting its ability to capture speaker transition cues effectively. Additionally, focal loss enhances the model’s performance, making it more robust to imbalanced data.
ISSN:2076-3417