Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation

Large Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field w...

Full description

Saved in:
Bibliographic Details
Main Authors: Tomohisa Seki, Yoshimasa Kawazoe, Hiromasa Ito, Yu Akagi, Toru Takiguchi, Kazuhiko Ohe
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-02-01
Series:Frontiers in Cardiovascular Medicine
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832087104894533632
author Tomohisa Seki
Yoshimasa Kawazoe
Yoshimasa Kawazoe
Hiromasa Ito
Yu Akagi
Toru Takiguchi
Kazuhiko Ohe
Kazuhiko Ohe
author_facet Tomohisa Seki
Yoshimasa Kawazoe
Yoshimasa Kawazoe
Hiromasa Ito
Yu Akagi
Toru Takiguchi
Kazuhiko Ohe
Kazuhiko Ohe
author_sort Tomohisa Seki
collection DOAJ
description Large Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field would be substantial. This study evaluated the diagnostic performance of zero-shot VQA with multimodal LLMs on 12-lead ECG images. The results revealed that multimodal LLM tended to make more errors in extracting and verbalizing image features than in describing preconditions and making logical inferences. Even when the answers were correct, erroneous descriptions of image features were common. These findings suggest a need for improved control over image hallucination and indicate that performance evaluation using the percentage of correct answers to multiple-choice questions may not be sufficient for performance assessment in VQA tasks.
format Article
id doaj-art-2ca6d8c71bf1416082415ef0050aed82
institution Kabale University
issn 2297-055X
language English
publishDate 2025-02-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Cardiovascular Medicine
spelling doaj-art-2ca6d8c71bf1416082415ef0050aed822025-02-06T07:09:27ZengFrontiers Media S.A.Frontiers in Cardiovascular Medicine2297-055X2025-02-011210.3389/fcvm.2025.14582891458289Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretationTomohisa Seki0Yoshimasa Kawazoe1Yoshimasa Kawazoe2Hiromasa Ito3Yu Akagi4Toru Takiguchi5Kazuhiko Ohe6Kazuhiko Ohe7Department of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanArtificial Intelligence and Digital Twin in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Healthcare Information Management, The University of Tokyo Hospital, Tokyo, JapanDepartment of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, JapanLarge Language Models (LLM) are increasingly multimodal, and Zero-Shot Visual Question Answering (VQA) shows promise for image interpretation. If zero-shot VQA can be applied to a 12-lead electrocardiogram (ECG), a prevalent diagnostic tool in the medical field, the potential benefits to the field would be substantial. This study evaluated the diagnostic performance of zero-shot VQA with multimodal LLMs on 12-lead ECG images. The results revealed that multimodal LLM tended to make more errors in extracting and verbalizing image features than in describing preconditions and making logical inferences. Even when the answers were correct, erroneous descriptions of image features were common. These findings suggest a need for improved control over image hallucination and indicate that performance evaluation using the percentage of correct answers to multiple-choice questions may not be sufficient for performance assessment in VQA tasks.https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/fulllarge language modelelectrocardiographyvisual question answeringhallucinationzero-shot learning
spellingShingle Tomohisa Seki
Yoshimasa Kawazoe
Yoshimasa Kawazoe
Hiromasa Ito
Yu Akagi
Toru Takiguchi
Kazuhiko Ohe
Kazuhiko Ohe
Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
Frontiers in Cardiovascular Medicine
large language model
electrocardiography
visual question answering
hallucination
zero-shot learning
title Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
title_full Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
title_fullStr Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
title_full_unstemmed Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
title_short Assessing the performance of zero-shot visual question answering in multimodal large language models for 12-lead ECG image interpretation
title_sort assessing the performance of zero shot visual question answering in multimodal large language models for 12 lead ecg image interpretation
topic large language model
electrocardiography
visual question answering
hallucination
zero-shot learning
url https://www.frontiersin.org/articles/10.3389/fcvm.2025.1458289/full
work_keys_str_mv AT tomohisaseki assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT yoshimasakawazoe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT yoshimasakawazoe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT hiromasaito assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT yuakagi assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT torutakiguchi assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT kazuhikoohe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation
AT kazuhikoohe assessingtheperformanceofzeroshotvisualquestionansweringinmultimodallargelanguagemodelsfor12leadecgimageinterpretation