Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals
Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | European Journal of Investigation in Health, Psychology and Education |
Subjects: | |
Online Access: | https://www.mdpi.com/2254-9625/15/1/9 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832588667435089920 |
---|---|
author | Inbar Levkovich |
author_facet | Inbar Levkovich |
author_sort | Inbar Levkovich |
collection | DOAJ |
description | Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model’s diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion. |
format | Article |
id | doaj-art-fce6e22721a0475b8b1fd305e76f2475 |
institution | Kabale University |
issn | 2174-8144 2254-9625 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | European Journal of Investigation in Health, Psychology and Education |
spelling | doaj-art-fce6e22721a0475b8b1fd305e76f24752025-01-24T13:30:40ZengMDPI AGEuropean Journal of Investigation in Health, Psychology and Education2174-81442254-96252025-01-01151910.3390/ejihpe15010009Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health ProfessionalsInbar Levkovich0Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, IsraelLarge language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model’s diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion.https://www.mdpi.com/2254-9625/15/1/9large language modelsartificial intelligencemental healthdepressionsuicideschizophrenia |
spellingShingle | Inbar Levkovich Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals European Journal of Investigation in Health, Psychology and Education large language models artificial intelligence mental health depression suicide schizophrenia |
title | Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals |
title_full | Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals |
title_fullStr | Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals |
title_full_unstemmed | Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals |
title_short | Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals |
title_sort | evaluating diagnostic accuracy and treatment efficacy in mental health a comparative analysis of large language model tools and mental health professionals |
topic | large language models artificial intelligence mental health depression suicide schizophrenia |
url | https://www.mdpi.com/2254-9625/15/1/9 |
work_keys_str_mv | AT inbarlevkovich evaluatingdiagnosticaccuracyandtreatmentefficacyinmentalhealthacomparativeanalysisoflargelanguagemodeltoolsandmentalhealthprofessionals |