A multimodal visual–language foundation model for computational ophthalmology

Abstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-la...

Full description

Saved in:
Bibliographic Details
Main Authors: Danli Shi, Weiyi Zhang, Jiancheng Yang, Siyu Huang, Xiaolan Chen, Pusheng Xu, Kai Jin, Shan Lin, Jin Wei, Mayinuer Yusufu, Shunming Liu, Qing Zhang, Zongyuan Ge, Xun Xu, Mingguang He
Format: Article
Language:English
Published: Nature Portfolio 2025-06-01
Series:npj Digital Medicine
Online Access:https://doi.org/10.1038/s41746-025-01772-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849329396201029632
author Danli Shi
Weiyi Zhang
Jiancheng Yang
Siyu Huang
Xiaolan Chen
Pusheng Xu
Kai Jin
Shan Lin
Jin Wei
Mayinuer Yusufu
Shunming Liu
Qing Zhang
Zongyuan Ge
Xun Xu
Mingguang He
author_facet Danli Shi
Weiyi Zhang
Jiancheng Yang
Siyu Huang
Xiaolan Chen
Pusheng Xu
Kai Jin
Shan Lin
Jin Wei
Mayinuer Yusufu
Shunming Liu
Qing Zhang
Zongyuan Ge
Xun Xu
Mingguang He
author_sort Danli Shi
collection DOAJ
description Abstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications.
format Article
id doaj-art-75a4ebd0cc1b49aba8b00a813d80f12c
institution Kabale University
issn 2398-6352
language English
publishDate 2025-06-01
publisher Nature Portfolio
record_format Article
series npj Digital Medicine
spelling doaj-art-75a4ebd0cc1b49aba8b00a813d80f12c2025-08-20T03:47:16ZengNature Portfolionpj Digital Medicine2398-63522025-06-018111310.1038/s41746-025-01772-2A multimodal visual–language foundation model for computational ophthalmologyDanli Shi0Weiyi Zhang1Jiancheng Yang2Siyu Huang3Xiaolan Chen4Pusheng Xu5Kai Jin6Shan Lin7Jin Wei8Mayinuer Yusufu9Shunming Liu10Qing Zhang11Zongyuan Ge12Xun Xu13Mingguang He14School of Optometry, The Hong Kong Polytechnic UniversitySchool of Optometry, The Hong Kong Polytechnic UniversitySwiss Federal Institute of Technology Lausanne (EPFL)School of Computing, Clemson UniversitySchool of Optometry, The Hong Kong Polytechnic UniversitySchool of Optometry, The Hong Kong Polytechnic UniversityDepartment of Ophthalmology, The Second Affiliated Hospital, School of Medicine, Zhejiang UniversityWuhan Bright Eye HospitalDepartment of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye DiseasesCentre for Eye Research Australia, Royal Victorian Eye and Ear HospitalDepartment of Ophthalmology, Guangdong Academy of Medical Sciences, Guangdong Provincial People’s HospitalBeijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical UniversityAIM for Health Lab, Faculty of Information Technology, Monash UniversityDepartment of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye DiseasesSchool of Optometry, The Hong Kong Polytechnic UniversityAbstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications.https://doi.org/10.1038/s41746-025-01772-2
spellingShingle Danli Shi
Weiyi Zhang
Jiancheng Yang
Siyu Huang
Xiaolan Chen
Pusheng Xu
Kai Jin
Shan Lin
Jin Wei
Mayinuer Yusufu
Shunming Liu
Qing Zhang
Zongyuan Ge
Xun Xu
Mingguang He
A multimodal visual–language foundation model for computational ophthalmology
npj Digital Medicine
title A multimodal visual–language foundation model for computational ophthalmology
title_full A multimodal visual–language foundation model for computational ophthalmology
title_fullStr A multimodal visual–language foundation model for computational ophthalmology
title_full_unstemmed A multimodal visual–language foundation model for computational ophthalmology
title_short A multimodal visual–language foundation model for computational ophthalmology
title_sort multimodal visual language foundation model for computational ophthalmology
url https://doi.org/10.1038/s41746-025-01772-2
work_keys_str_mv AT danlishi amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT weiyizhang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT jianchengyang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT siyuhuang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT xiaolanchen amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT pushengxu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT kaijin amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT shanlin amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT jinwei amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT mayinueryusufu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT shunmingliu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT qingzhang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT zongyuange amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT xunxu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT mingguanghe amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT danlishi multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT weiyizhang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT jianchengyang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT siyuhuang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT xiaolanchen multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT pushengxu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT kaijin multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT shanlin multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT jinwei multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT mayinueryusufu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT shunmingliu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT qingzhang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT zongyuange multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT xunxu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology
AT mingguanghe multimodalvisuallanguagefoundationmodelforcomputationalophthalmology