Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study

BackgroundThe COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women’s health, has emerged....

Full description

Saved in:

Bibliographic Details
Main Authors:	Nicola Luigi Bragazzi, Michèle Buchinger, Hisham Atwan, Ruba Tuma, Francesco Chirico, Lukasz Szarpak, Raymond Farah, Rola Khamisy-Farah
Format:	Article
Language:	English
Published:	JMIR Publications 2025-02-01
Series:	JMIR Formative Research
Online Access:	https://formative.jmir.org/2025/1/e56126
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832088157008429056
author	Nicola Luigi Bragazzi Michèle Buchinger Hisham Atwan Ruba Tuma Francesco Chirico Lukasz Szarpak Raymond Farah Rola Khamisy-Farah
author_facet	Nicola Luigi Bragazzi Michèle Buchinger Hisham Atwan Ruba Tuma Francesco Chirico Lukasz Szarpak Raymond Farah Rola Khamisy-Farah
author_sort	Nicola Luigi Bragazzi
collection	DOAJ
description	BackgroundThe COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women’s health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women’s health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. ObjectiveThis study aimed to assess LLMs’ proficiency, clarity, and objectivity regarding COVID-19’s impacts on pregnancy. MethodsThis study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. ResultsIn terms of LLMs’ knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (–4), followed by ChatGPT-4 (–6) and Google Bard (–7), while ChatGPT-3.5 obtained the most negative score (–12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. ConclusionsThe study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI’s approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.
format	Article
id	doaj-art-19a3976f3e1b422dbabf3cea0c242b38
institution	Kabale University
issn	2561-326X
language	English
publishDate	2025-02-01
publisher	JMIR Publications
record_format	Article
series	JMIR Formative Research
spelling	doaj-art-19a3976f3e1b422dbabf3cea0c242b382025-02-05T21:30:33ZengJMIR PublicationsJMIR Formative Research2561-326X2025-02-019e5612610.2196/56126Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot StudyNicola Luigi Bragazzihttps://orcid.org/0000-0001-8409-868XMichèle Buchingerhttps://orcid.org/0009-0003-9186-4882Hisham Atwanhttps://orcid.org/0009-0008-0433-8152Ruba Tumahttps://orcid.org/0000-0002-1578-5854Francesco Chiricohttps://orcid.org/0000-0002-8737-4368Lukasz Szarpakhttps://orcid.org/0000-0002-0973-5455Raymond Farahhttps://orcid.org/0000-0002-9777-5106Rola Khamisy-Farahhttps://orcid.org/0000-0002-0578-7178 BackgroundThe COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women’s health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women’s health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. ObjectiveThis study aimed to assess LLMs’ proficiency, clarity, and objectivity regarding COVID-19’s impacts on pregnancy. MethodsThis study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. ResultsIn terms of LLMs’ knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (–4), followed by ChatGPT-4 (–6) and Google Bard (–7), while ChatGPT-3.5 obtained the most negative score (–12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. ConclusionsThe study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI’s approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level.https://formative.jmir.org/2025/1/e56126
spellingShingle	Nicola Luigi Bragazzi Michèle Buchinger Hisham Atwan Ruba Tuma Francesco Chirico Lukasz Szarpak Raymond Farah Rola Khamisy-Farah Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study JMIR Formative Research
title	Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study
title_full	Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study
title_fullStr	Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study
title_full_unstemmed	Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study
title_short	Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study
title_sort	proficiency clarity and objectivity of large language models versus specialists knowledge on covid 19 s impacts in pregnancy cross sectional pilot study
url	https://formative.jmir.org/2025/1/e56126
work_keys_str_mv	AT nicolaluigibragazzi proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT michelebuchinger proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT hishamatwan proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT rubatuma proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT francescochirico proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT lukaszszarpak proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT raymondfarah proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy AT rolakhamisyfarah proficiencyclarityandobjectivityoflargelanguagemodelsversusspecialistsknowledgeoncovid19simpactsinpregnancycrosssectionalpilotstudy

Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study

Similar Items