Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized Bi...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/9/4766 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics. |
|---|---|
| ISSN: | 2076-3417 |