Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
Large Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field w...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-02-01
|
Series: | Frontiers in Cardiovascular Medicine |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832087104894533632 |
---|---|
author | Tomohisa Seki Yoshimasa Kawazoe Yoshimasa Kawazoe Hiromasa Ito Yu Akagi Toru Takiguchi Kazuhiko Ohe Kazuhiko Ohe |
author_facet | Tomohisa Seki Yoshimasa Kawazoe Yoshimasa Kawazoe Hiromasa Ito Yu Akagi Toru Takiguchi Kazuhiko Ohe Kazuhiko Ohe |
author_sort | Tomohisa Seki |
collection | DOAJ |
description | Large Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field would be substantial. This study evaluated the diagnostic performance of zero-shot VQA with multimodal LLMs on 12-lead ECG images. The results revealed that multimodal LLM tended to make more errors in extracting and verbalizing image features than in describing preconditions and making logical inferences. Even when the answers were correct, erroneous descriptions of image features were common. These findings suggest a need for improved control over image hallucination and indicate that performance evaluation using the percentage of correct answers to multiple-choice questions may not be sufficient for performance assessment in VQA tasks. |
format | Article |
id | doaj-art-2ca6d8c71bf1416082415ef0050aed82 |
institution | Kabale University |
issn | 2297-055X |
language | English |
publishDate | 2025-02-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Cardiovascular Medicine |
spelling | doaj-art-2ca6d8c71bf1416082415ef0050aed822025-02-06T07:09:27ZengFrontiers Media S.A.Frontiers in Cardiovascular Medicine2297-055X2025-02-011210.3389/fcvm.2025.14582891458289Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretationTomohisa Seki0Yoshimasa Kawazoe1Yoshimasa Kawazoe2Hiromasa Ito3Yu Akagi4Toru Takiguchi5Kazuhiko Ohe6Kazuhiko Ohe7Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanArtificial Intelligence and Digital Twin in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanLarge Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field would be substantial. This study evaluated the diagnostic performance of zero-shot VQA with multimodal LLMs on 12-lead ECG images. The results revealed that multimodal LLM tended to make more errors in extracting and verbalizing image features than in describing preconditions and making logical inferences. Even when the answers were correct, erroneous descriptions of image features were common. These findings suggest a need for improved control over image hallucination and indicate that performance evaluation using the percentage of correct answers to multiple-choice questions may not be sufficient for performance assessment in VQA tasks.https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/fulllarge language modelelectrocardiographyvisual question answeringhallucinationzero-shot learning |
spellingShingle | Tomohisa Seki Yoshimasa Kawazoe Yoshimasa Kawazoe Hiromasa Ito Yu Akagi Toru Takiguchi Kazuhiko Ohe Kazuhiko Ohe Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation Frontiers in Cardiovascular Medicine large language model electrocardiography visual question answering hallucination zero-shot learning |
title | Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation |
title_full | Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation |
title_fullStr | Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation |
title_full_unstemmed | Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation |
title_short | Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation |
title_sort | assessing the performance of zero shot visual question answering in multimodal large language models for 12 lead ecg image interpretation |
topic | large language model electrocardiography visual question answering hallucination zero-shot learning |
url | https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/full |
work_keys_str_mv | AT tomohisaseki assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT yoshimasakawazoe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT yoshimasakawazoe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT hiromasaito assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT yuakagi assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT torutakiguchi assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT kazuhikoohe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation AT kazuhikoohe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation |