Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
Abstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigatio...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer Nature
2025-08-01
|
| Series: | AI in Civil Engineering |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s43503-025-00063-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849767041887633408 |
|---|---|
| author | Pedram Bazrafshan Kris Melag Arvin Ebrahimkhanlou |
| author_facet | Pedram Bazrafshan Kris Melag Arvin Ebrahimkhanlou |
| author_sort | Pedram Bazrafshan |
| collection | DOAJ |
| description | Abstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigation of VLMs for this specialized domain, which has not been previously addressed. As a case study, the paper evaluates ChatGPT-4v’s ability to serve as a descriptor tool by comparing its performance with three human descriptions (a civil engineer and two engineering interns). The contributions of this work include adapting a pre-trained VLM to civil engineering applications without additional fine-tuning and benchmarking its performance using both semantic similarity analysis (SentenceTransformers) and lexical similarity methods. Utilizing two datasets—one from a publicly available online repository and another manually collected by the authors—the study employs whole-text and sentence pair-wise similarity analyses to assess the model’s alignment with human descriptions. Results demonstrate that the best-performing model achieved an average similarity of 76% (4% standard deviation) when compared to human-generated descriptions. The analysis also reveals better performance on the publicly available dataset. |
| format | Article |
| id | doaj-art-0b51a95db61f4d0685c98e635779b664 |
| institution | DOAJ |
| issn | 2097-0943 2730-5392 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Springer Nature |
| record_format | Article |
| series | AI in Civil Engineering |
| spelling | doaj-art-0b51a95db61f4d0685c98e635779b6642025-08-20T03:04:22ZengSpringer NatureAI in Civil Engineering2097-09432730-53922025-08-014111910.1007/s43503-025-00063-9Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineeringPedram Bazrafshan0Kris Melag1Arvin Ebrahimkhanlou2Civil, Architectural, and Environmental Engineering, Drexel UniversityCivil, Architectural, and Environmental Engineering, Drexel UniversityCivil, Architectural, and Environmental Engineering, Drexel UniversityAbstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigation of VLMs for this specialized domain, which has not been previously addressed. As a case study, the paper evaluates ChatGPT-4v’s ability to serve as a descriptor tool by comparing its performance with three human descriptions (a civil engineer and two engineering interns). The contributions of this work include adapting a pre-trained VLM to civil engineering applications without additional fine-tuning and benchmarking its performance using both semantic similarity analysis (SentenceTransformers) and lexical similarity methods. Utilizing two datasets—one from a publicly available online repository and another manually collected by the authors—the study employs whole-text and sentence pair-wise similarity analyses to assess the model’s alignment with human descriptions. Results demonstrate that the best-performing model achieved an average similarity of 76% (4% standard deviation) when compared to human-generated descriptions. The analysis also reveals better performance on the publicly available dataset.https://doi.org/10.1007/s43503-025-00063-9Vision language modelsArtificial intelligenceImage descriptionPre-Trained TransformersCivil engineeringDigital twin |
| spellingShingle | Pedram Bazrafshan Kris Melag Arvin Ebrahimkhanlou Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering AI in Civil Engineering Vision language models Artificial intelligence Image description Pre-Trained Transformers Civil engineering Digital twin |
| title | Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| title_full | Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| title_fullStr | Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| title_full_unstemmed | Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| title_short | Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| title_sort | semantic and lexical analysis of pre trained vision language artificial intelligence models for automated image descriptions in civil engineering |
| topic | Vision language models Artificial intelligence Image description Pre-Trained Transformers Civil engineering Digital twin |
| url | https://doi.org/10.1007/s43503-025-00063-9 |
| work_keys_str_mv | AT pedrambazrafshan semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering AT krismelag semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering AT arvinebrahimkhanlou semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering |