Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering

Abstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigatio...

Full description

Saved in:
Bibliographic Details
Main Authors: Pedram Bazrafshan, Kris Melag, Arvin Ebrahimkhanlou
Format: Article
Language:English
Published: Springer Nature 2025-08-01
Series:AI in Civil Engineering
Subjects:
Online Access:https://doi.org/10.1007/s43503-025-00063-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849767041887633408
author Pedram Bazrafshan
Kris Melag
Arvin Ebrahimkhanlou
author_facet Pedram Bazrafshan
Kris Melag
Arvin Ebrahimkhanlou
author_sort Pedram Bazrafshan
collection DOAJ
description Abstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigation of VLMs for this specialized domain, which has not been previously addressed. As a case study, the paper evaluates ChatGPT-4v’s ability to serve as a descriptor tool by comparing its performance with three human descriptions (a civil engineer and two engineering interns). The contributions of this work include adapting a pre-trained VLM to civil engineering applications without additional fine-tuning and benchmarking its performance using both semantic similarity analysis (SentenceTransformers) and lexical similarity methods. Utilizing two datasets—one from a publicly available online repository and another manually collected by the authors—the study employs whole-text and sentence pair-wise similarity analyses to assess the model’s alignment with human descriptions. Results demonstrate that the best-performing model achieved an average similarity of 76% (4% standard deviation) when compared to human-generated descriptions. The analysis also reveals better performance on the publicly available dataset.
format Article
id doaj-art-0b51a95db61f4d0685c98e635779b664
institution DOAJ
issn 2097-0943
2730-5392
language English
publishDate 2025-08-01
publisher Springer Nature
record_format Article
series AI in Civil Engineering
spelling doaj-art-0b51a95db61f4d0685c98e635779b6642025-08-20T03:04:22ZengSpringer NatureAI in Civil Engineering2097-09432730-53922025-08-014111910.1007/s43503-025-00063-9Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineeringPedram Bazrafshan0Kris Melag1Arvin Ebrahimkhanlou2Civil, Architectural, and Environmental Engineering, Drexel UniversityCivil, Architectural, and Environmental Engineering, Drexel UniversityCivil, Architectural, and Environmental Engineering, Drexel UniversityAbstract This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigation of VLMs for this specialized domain, which has not been previously addressed. As a case study, the paper evaluates ChatGPT-4v’s ability to serve as a descriptor tool by comparing its performance with three human descriptions (a civil engineer and two engineering interns). The contributions of this work include adapting a pre-trained VLM to civil engineering applications without additional fine-tuning and benchmarking its performance using both semantic similarity analysis (SentenceTransformers) and lexical similarity methods. Utilizing two datasets—one from a publicly available online repository and another manually collected by the authors—the study employs whole-text and sentence pair-wise similarity analyses to assess the model’s alignment with human descriptions. Results demonstrate that the best-performing model achieved an average similarity of 76% (4% standard deviation) when compared to human-generated descriptions. The analysis also reveals better performance on the publicly available dataset.https://doi.org/10.1007/s43503-025-00063-9Vision language modelsArtificial intelligenceImage descriptionPre-Trained TransformersCivil engineeringDigital twin
spellingShingle Pedram Bazrafshan
Kris Melag
Arvin Ebrahimkhanlou
Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
AI in Civil Engineering
Vision language models
Artificial intelligence
Image description
Pre-Trained Transformers
Civil engineering
Digital twin
title Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
title_full Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
title_fullStr Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
title_full_unstemmed Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
title_short Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
title_sort semantic and lexical analysis of pre trained vision language artificial intelligence models for automated image descriptions in civil engineering
topic Vision language models
Artificial intelligence
Image description
Pre-Trained Transformers
Civil engineering
Digital twin
url https://doi.org/10.1007/s43503-025-00063-9
work_keys_str_mv AT pedrambazrafshan semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering
AT krismelag semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering
AT arvinebrahimkhanlou semanticandlexicalanalysisofpretrainedvisionlanguageartificialintelligencemodelsforautomatedimagedescriptionsincivilengineering