Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis
Few-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-tra...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-02-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/4/2095 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849718831654633472 |
|---|---|
| author | Junyang Yang Jiuxin Cao Chengge Duan |
| author_facet | Junyang Yang Jiuxin Cao Chengge Duan |
| author_sort | Junyang Yang |
| collection | DOAJ |
| description | Few-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-trained language model (PLM) and then leveraged prompt learning, showing effectiveness in few-shot scenarios. However, these methods encounter challenges in aligning the high-level semantics of different modalities due to their inherent heterogeneity, which impacts the performance of sentiment analysis. In this paper, we propose a novel framework called Multi-task Supervised Alignment Pre-training (MSAP) to enhance modality alignment and consequently improve the performance of multimodal sentiment analysis. Our approach uses a multi-task training method—incorporating image classification, image style recognition, and image captioning—to extract modal-shared information and stronger semantics to improve visual representation. We employ task-specific prompts to unify these diverse objectives into a single Masked Language Model (MLM), which serves as the foundation for our Multi-task Supervised Alignment Pre-training (MSAP) framework to enhance the alignment of visual and textual modalities. Extensive experiments on three datasets demonstrate that our method achieves a new state-of-the-art for the FMSA task. |
| format | Article |
| id | doaj-art-d6fcf5ea93784dee92c7434f43fe8a1d |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-d6fcf5ea93784dee92c7434f43fe8a1d2025-08-20T03:12:16ZengMDPI AGApplied Sciences2076-34172025-02-01154209510.3390/app15042095Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment AnalysisJunyang Yang0Jiuxin Cao1Chengge Duan2School of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaSchool of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaSchool of Cyber Science and Engineering, Southeast University, Nanjing 211189, ChinaFew-shot multimodal sentiment analysis (FMSA) has garnered substantial attention due to the proliferation of multimedia applications, especially given the frequent difficulty in obtaining large quantities of training samples. Previous works have directly incorporated vision modality into the pre-trained language model (PLM) and then leveraged prompt learning, showing effectiveness in few-shot scenarios. However, these methods encounter challenges in aligning the high-level semantics of different modalities due to their inherent heterogeneity, which impacts the performance of sentiment analysis. In this paper, we propose a novel framework called Multi-task Supervised Alignment Pre-training (MSAP) to enhance modality alignment and consequently improve the performance of multimodal sentiment analysis. Our approach uses a multi-task training method—incorporating image classification, image style recognition, and image captioning—to extract modal-shared information and stronger semantics to improve visual representation. We employ task-specific prompts to unify these diverse objectives into a single Masked Language Model (MLM), which serves as the foundation for our Multi-task Supervised Alignment Pre-training (MSAP) framework to enhance the alignment of visual and textual modalities. Extensive experiments on three datasets demonstrate that our method achieves a new state-of-the-art for the FMSA task.https://www.mdpi.com/2076-3417/15/4/2095few-shot multimodal sentiment analysismulti-task learningsupervised pre-training |
| spellingShingle | Junyang Yang Jiuxin Cao Chengge Duan Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis Applied Sciences few-shot multimodal sentiment analysis multi-task learning supervised pre-training |
| title | Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis |
| title_full | Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis |
| title_fullStr | Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis |
| title_full_unstemmed | Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis |
| title_short | Multi-Task Supervised Alignment Pre-Training for Few-Shot Multimodal Sentiment Analysis |
| title_sort | multi task supervised alignment pre training for few shot multimodal sentiment analysis |
| topic | few-shot multimodal sentiment analysis multi-task learning supervised pre-training |
| url | https://www.mdpi.com/2076-3417/15/4/2095 |
| work_keys_str_mv | AT junyangyang multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis AT jiuxincao multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis AT chenggeduan multitasksupervisedalignmentpretrainingforfewshotmultimodalsentimentanalysis |