A multimodal visual–language foundation model for computational ophthalmology
Abstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-la...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-06-01
|
| Series: | npj Digital Medicine |
| Online Access: | https://doi.org/10.1038/s41746-025-01772-2 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849329396201029632 |
|---|---|
| author | Danli Shi Weiyi Zhang Jiancheng Yang Siyu Huang Xiaolan Chen Pusheng Xu Kai Jin Shan Lin Jin Wei Mayinuer Yusufu Shunming Liu Qing Zhang Zongyuan Ge Xun Xu Mingguang He |
| author_facet | Danli Shi Weiyi Zhang Jiancheng Yang Siyu Huang Xiaolan Chen Pusheng Xu Kai Jin Shan Lin Jin Wei Mayinuer Yusufu Shunming Liu Qing Zhang Zongyuan Ge Xun Xu Mingguang He |
| author_sort | Danli Shi |
| collection | DOAJ |
| description | Abstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications. |
| format | Article |
| id | doaj-art-75a4ebd0cc1b49aba8b00a813d80f12c |
| institution | Kabale University |
| issn | 2398-6352 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | npj Digital Medicine |
| spelling | doaj-art-75a4ebd0cc1b49aba8b00a813d80f12c2025-08-20T03:47:16ZengNature Portfolionpj Digital Medicine2398-63522025-06-018111310.1038/s41746-025-01772-2A multimodal visual–language foundation model for computational ophthalmologyDanli Shi0Weiyi Zhang1Jiancheng Yang2Siyu Huang3Xiaolan Chen4Pusheng Xu5Kai Jin6Shan Lin7Jin Wei8Mayinuer Yusufu9Shunming Liu10Qing Zhang11Zongyuan Ge12Xun Xu13Mingguang He14School of Optometry, The Hong Kong Polytechnic UniversitySchool of Optometry, The Hong Kong Polytechnic UniversitySwiss Federal Institute of Technology Lausanne (EPFL)School of Computing, Clemson UniversitySchool of Optometry, The Hong Kong Polytechnic UniversitySchool of Optometry, The Hong Kong Polytechnic UniversityDepartment of Ophthalmology, The Second Affiliated Hospital, School of Medicine, Zhejiang UniversityWuhan Bright Eye HospitalDepartment of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye DiseasesCentre for Eye Research Australia, Royal Victorian Eye and Ear HospitalDepartment of Ophthalmology, Guangdong Academy of Medical Sciences, Guangdong Provincial People’s HospitalBeijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical UniversityAIM for Health Lab, Faculty of Information Technology, Monash UniversityDepartment of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye DiseasesSchool of Optometry, The Hong Kong Polytechnic UniversityAbstract Early detection of eye diseases is vital for preventing vision loss. Existing ophthalmic artificial intelligence models focus on single modalities, overlooking multi-view information and struggling with rare diseases due to long-tail distributions. We propose EyeCLIP, a multimodal visual-language foundation model trained on 2.77 million ophthalmology images from 11 modalities with partial clinical text. Our novel pretraining strategy combines self-supervised reconstruction, multimodal image contrastive learning, and image-text contrastive learning to capture shared representations across modalities. EyeCLIP demonstrates robust performance across 14 benchmark datasets, excelling in disease classification, visual question answering, and cross-modal retrieval. It also exhibits strong few-shot and zero-shot capabilities, enabling accurate predictions in real-world, long-tail scenarios. EyeCLIP offers significant potential for detecting both ocular and systemic diseases, and bridging gaps in real-world clinical applications.https://doi.org/10.1038/s41746-025-01772-2 |
| spellingShingle | Danli Shi Weiyi Zhang Jiancheng Yang Siyu Huang Xiaolan Chen Pusheng Xu Kai Jin Shan Lin Jin Wei Mayinuer Yusufu Shunming Liu Qing Zhang Zongyuan Ge Xun Xu Mingguang He A multimodal visual–language foundation model for computational ophthalmology npj Digital Medicine |
| title | A multimodal visual–language foundation model for computational ophthalmology |
| title_full | A multimodal visual–language foundation model for computational ophthalmology |
| title_fullStr | A multimodal visual–language foundation model for computational ophthalmology |
| title_full_unstemmed | A multimodal visual–language foundation model for computational ophthalmology |
| title_short | A multimodal visual–language foundation model for computational ophthalmology |
| title_sort | multimodal visual language foundation model for computational ophthalmology |
| url | https://doi.org/10.1038/s41746-025-01772-2 |
| work_keys_str_mv | AT danlishi amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT weiyizhang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT jianchengyang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT siyuhuang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT xiaolanchen amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT pushengxu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT kaijin amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT shanlin amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT jinwei amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT mayinueryusufu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT shunmingliu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT qingzhang amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT zongyuange amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT xunxu amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT mingguanghe amultimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT danlishi multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT weiyizhang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT jianchengyang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT siyuhuang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT xiaolanchen multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT pushengxu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT kaijin multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT shanlin multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT jinwei multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT mayinueryusufu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT shunmingliu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT qingzhang multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT zongyuange multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT xunxu multimodalvisuallanguagefoundationmodelforcomputationalophthalmology AT mingguanghe multimodalvisuallanguagefoundationmodelforcomputationalophthalmology |