EMVAS: end-to-end multimodal emotion visualization analysis system
Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To addr...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-07-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01931-8 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849235388511551488 |
|---|---|
| author | Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang |
| author_facet | Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang |
| author_sort | Xianxun Zhu |
| collection | DOAJ |
| description | Abstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstract |
| format | Article |
| id | doaj-art-9d1e5047fd514f44a7e95de932f2a30c |
| institution | Kabale University |
| issn | 2199-4536 2198-6053 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Springer |
| record_format | Article |
| series | Complex & Intelligent Systems |
| spelling | doaj-art-9d1e5047fd514f44a7e95de932f2a30c2025-08-20T04:02:49ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111811510.1007/s40747-025-01931-8EMVAS: end-to-end multimodal emotion visualization analysis systemXianxun Zhu0Heyang Feng1Erik Cambria2Yao Huang3Ming Ju4Haochen Yuan5Rui Wang6School of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityCollege of Computing and Data Science, Nanyang Technological UniversitySchool of Communication and Information Engineering, Shanghai UniversitySchool of Communication and Information Engineering, Shanghai UniversityFaculty of Computing, Harbin Institute of TechnologySchool of Communication and Information Engineering, Shanghai UniversityAbstract Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstracthttps://doi.org/10.1007/s40747-025-01931-8End-to-endMultimodalEmotion analysisVisualization system |
| spellingShingle | Xianxun Zhu Heyang Feng Erik Cambria Yao Huang Ming Ju Haochen Yuan Rui Wang EMVAS: end-to-end multimodal emotion visualization analysis system Complex & Intelligent Systems End-to-end Multimodal Emotion analysis Visualization system |
| title | EMVAS: end-to-end multimodal emotion visualization analysis system |
| title_full | EMVAS: end-to-end multimodal emotion visualization analysis system |
| title_fullStr | EMVAS: end-to-end multimodal emotion visualization analysis system |
| title_full_unstemmed | EMVAS: end-to-end multimodal emotion visualization analysis system |
| title_short | EMVAS: end-to-end multimodal emotion visualization analysis system |
| title_sort | emvas end to end multimodal emotion visualization analysis system |
| topic | End-to-end Multimodal Emotion analysis Visualization system |
| url | https://doi.org/10.1007/s40747-025-01931-8 |
| work_keys_str_mv | AT xianxunzhu emvasendtoendmultimodalemotionvisualizationanalysissystem AT heyangfeng emvasendtoendmultimodalemotionvisualizationanalysissystem AT erikcambria emvasendtoendmultimodalemotionvisualizationanalysissystem AT yaohuang emvasendtoendmultimodalemotionvisualizationanalysissystem AT mingju emvasendtoendmultimodalemotionvisualizationanalysissystem AT haochenyuan emvasendtoendmultimodalemotionvisualizationanalysissystem AT ruiwang emvasendtoendmultimodalemotionvisualizationanalysissystem |