EMVAS: end-to-end multimodal emotion visualization analysis system

Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To addr...

Full description

Saved in:
Bibliographic Details
Main Authors: Xianxun Zhu, Heyang Feng, Erik Cambria, Yao Huang, Ming Ju, Haochen Yuan, Rui Wang
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01931-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849235388511551488
author Xianxun Zhu
Heyang Feng
Erik Cambria
Yao Huang
Ming Ju
Haochen Yuan
Rui Wang
author_facet Xianxun Zhu
Heyang Feng
Erik Cambria
Yao Huang
Ming Ju
Haochen Yuan
Rui Wang
author_sort Xianxun Zhu
collection DOAJ
description Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstract
format Article
id doaj-art-9d1e5047fd514f44a7e95de932f2a30c
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-9d1e5047fd514f44a7e95de932f2a30c2025-08-20T04:02:49ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111811510.1007/s40747-025-01931-8EMVAS: end-to-end multimodal emotion visualization analysis systemXianxun Zhu0Heyang Feng1Erik Cambria2Yao Huang3Ming Ju4Haochen Yuan5Rui Wang6School of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityCollege of Computing and Data Science, Nanyang Technological UniversitySchool of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityFaculty of Computing, Harbin Institute of TechnologySchool of Communication and Information Engineering, Shanghai UniversityAbstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstracthttps://doi.org/10.1007/s40747-025-01931-8End-to-endMultimodalEmotion analysisVisualization system
spellingShingle Xianxun Zhu
Heyang Feng
Erik Cambria
Yao Huang
Ming Ju
Haochen Yuan
Rui Wang
EMVAS: end-to-end multimodal emotion visualization analysis system
Complex & Intelligent Systems
End-to-end
Multimodal
Emotion analysis
Visualization system
title EMVAS: end-to-end multimodal emotion visualization analysis system
title_full EMVAS: end-to-end multimodal emotion visualization analysis system
title_fullStr EMVAS: end-to-end multimodal emotion visualization analysis system
title_full_unstemmed EMVAS: end-to-end multimodal emotion visualization analysis system
title_short EMVAS: end-to-end multimodal emotion visualization analysis system
title_sort emvas end to end multimodal emotion visualization analysis system
topic End-to-end
Multimodal
Emotion analysis
Visualization system
url https://doi.org/10.1007/s40747-025-01931-8
work_keys_str_mv AT xianxunzhu emvasendtoendmultimodalemotionvisualizationanalysissystem
AT heyangfeng emvasendtoendmultimodalemotionvisualizationanalysissystem
AT erikcambria emvasendtoendmultimodalemotionvisualizationanalysissystem
AT yaohuang emvasendtoendmultimodalemotionvisualizationanalysissystem
AT mingju emvasendtoendmultimodalemotionvisualizationanalysissystem
AT haochenyuan emvasendtoendmultimodalemotionvisualizationanalysissystem
AT ruiwang emvasendtoendmultimodalemotionvisualizationanalysissystem