Hierarchical cross-modal attention and dual audio pathways for enhanced multimodal sentiment analysis

Abstract This paper presents a new architecture for multimodal sentiment analysis exploiting hierarchical cross-modal attention mechanisms, as well as two parallel lanes for audio analysis. Traditional sentiment analysis approaches are mainly based on text data, which can be inefficient as valuable...

Full description

Saved in:
Bibliographic Details
Main Authors: D. Vamsidhar, Parth Desai, Aniket K. Shahade, Shruti Patil, Priyanka V. Deshmukh
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-09000-3
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract This paper presents a new architecture for multimodal sentiment analysis exploiting hierarchical cross-modal attention mechanisms, as well as two parallel lanes for audio analysis. Traditional sentiment analysis approaches are mainly based on text data, which can be inefficient as valuable sentiment information may reside within images and audio. Aiming at solving this issue, the model provides a unified framework that integrates three modalities (text, image, audio) based on BERT text encoder, ResNet50 visual features extractor and hybrid CNN-Wav2Vec2.0 pipeline for audio representation. Specifically, its main innovation is a dual audio pathway augmented with a dynamic gating module and a cross-modal self-attention layer that enables fine-grained interaction among modalities. Our model reports state-of-the-art performance on various benchmarks, outperforming recent approaches: CLIP, MISA and MSFNet. Such that, the results reveal an improvement of classification accuracy especially with missing or noisy modality data. The system has robustness and reliability, which is validated with an exhaustive analysis through metrics like precision, recall, F1-score, and confusion-matrices. In addition, such an architecture demonstrates modular scalability and adaptability across domains, making it proficient for applications in healthcare, social media, and customer service. By providing a framework for developing affective AI systems that can decode human emotion from intricate multimodal features, the study lays the groundwork for future research into further processing of such data streams in the longer term, including real-time processing, domain-specific adjustments, and extending the analysis to the addition of multi-channel sensor input combining physiological and temporal data streams.
ISSN:2045-2322