Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals

Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-...

Full description

Saved in:
Bibliographic Details
Main Author: Inbar Levkovich
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:European Journal of Investigation in Health, Psychology and Education
Subjects:
Online Access:https://www.mdpi.com/2254-9625/15/1/9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588667435089920
author Inbar Levkovich
author_facet Inbar Levkovich
author_sort Inbar Levkovich
collection DOAJ
description Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model’s diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion.
format Article
id doaj-art-fce6e22721a0475b8b1fd305e76f2475
institution Kabale University
issn 2174-8144
2254-9625
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series European Journal of Investigation in Health, Psychology and Education
spelling doaj-art-fce6e22721a0475b8b1fd305e76f24752025-01-24T13:30:40ZengMDPI AGEuropean Journal of Investigation in Health, Psychology and Education2174-81442254-96252025-01-01151910.3390/ejihpe15010009Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health ProfessionalsInbar Levkovich0Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, IsraelLarge language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model’s diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion.https://www.mdpi.com/2254-9625/15/1/9large language modelsartificial intelligencemental healthdepressionsuicideschizophrenia
spellingShingle Inbar Levkovich
Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
European Journal of Investigation in Health, Psychology and Education
large language models
artificial intelligence
mental health
depression
suicide
schizophrenia
title Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
title_full Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
title_fullStr Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
title_full_unstemmed Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
title_short Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
title_sort evaluating diagnostic accuracy and treatment efficacy in mental health a comparative analysis of large language model tools and mental health professionals
topic large language models
artificial intelligence
mental health
depression
suicide
schizophrenia
url https://www.mdpi.com/2254-9625/15/1/9
work_keys_str_mv AT inbarlevkovich evaluatingdiagnosticaccuracyandtreatmentefficacyinmentalhealthacomparativeanalysisoflargelanguagemodeltoolsandmentalhealthprofessionals