EMVAS: end-to-end multimodal emotion visualization analysis system

Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To addr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xianxun Zhu, Heyang Feng, Erik Cambria, Yao Huang, Ming Ju, Haochen Yuan, Rui Wang
Format:	Article
Language:	English
Published:	Springer 2025-07-01
Series:	Complex & Intelligent Systems
Subjects:	End-to-end Multimodal Emotion analysis Visualization system
Online Access:	https://doi.org/10.1007/s40747-025-01931-8
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849235388511551488
author	Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang
author_facet	Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang
author_sort	Xianxun Zhu
collection	DOAJ
description	Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstract
format	Article
id	doaj-art-9d1e5047fd514f44a7e95de932f2a30c
institution	Kabale University
issn	2199-4536 2198-6053
language	English
publishDate	2025-07-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-9d1e5047fd514f44a7e95de932f2a30c2025-08-20T04:02:49ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111811510.1007/s40747-025-01931-8EMVAS: end-to-end multimodal emotion visualization analysis systemXianxun Zhu0Heyang Feng1Erik Cambria2Yao Huang3Ming Ju4Haochen Yuan5Rui Wang6School of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityCollege of Computing and Data Science, Nanyang Technological UniversitySchool of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityFaculty of Computing, Harbin Institute of TechnologySchool of Communication and Information Engineering, Shanghai UniversityAbstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstracthttps://doi.org/10.1007/s40747-025-01931-8End-to-endMultimodalEmotion analysisVisualization system
spellingShingle	Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang EMVAS: end-to-end multimodal emotion visualization analysis system Complex & Intelligent Systems End-to-end Multimodal Emotion analysis Visualization system
title	EMVAS: end-to-end multimodal emotion visualization analysis system
title_full	EMVAS: end-to-end multimodal emotion visualization analysis system
title_fullStr	EMVAS: end-to-end multimodal emotion visualization analysis system
title_full_unstemmed	EMVAS: end-to-end multimodal emotion visualization analysis system
title_short	EMVAS: end-to-end multimodal emotion visualization analysis system
title_sort	emvas end to end multimodal emotion visualization analysis system
topic	End-to-end Multimodal Emotion analysis Visualization system
url	https://doi.org/10.1007/s40747-025-01931-8
work_keys_str_mv	AT xianxunzhu emvasendtoendmultimodalemotionvisualizationanalysissystem AT heyangfeng emvasendtoendmultimodalemotionvisualizationanalysissystem AT erikcambria emvasendtoendmultimodalemotionvisualizationanalysissystem AT yaohuang emvasendtoendmultimodalemotionvisualizationanalysissystem AT mingju emvasendtoendmultimodalemotionvisualizationanalysissystem AT haochenyuan emvasendtoendmultimodalemotionvisualizationanalysissystem AT ruiwang emvasendtoendmultimodalemotionvisualizationanalysissystem

EMVAS: end-to-end multimodal emotion visualization analysis system

Similar Items