Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules

The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts,...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhiyong Han, Fortunato Battaglia, Kush Mansuria, Yoav Heyman, Stanley R. Terlecky
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:AI
Subjects:
Online Access:https://www.mdpi.com/2673-2688/6/1/12
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589409574191104
author Zhiyong Han
Fortunato Battaglia
Kush Mansuria
Yoav Heyman
Stanley R. Terlecky
author_facet Zhiyong Han
Fortunato Battaglia
Kush Mansuria
Yoav Heyman
Stanley R. Terlecky
author_sort Zhiyong Han
collection DOAJ
description The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.
format Article
id doaj-art-76e269521f8742d8853f9a6a607d3411
institution Kabale University
issn 2673-2688
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series AI
spelling doaj-art-76e269521f8742d8853f9a6a607d34112025-01-24T13:17:23ZengMDPI AGAI2673-26882025-01-01611210.3390/ai6010012Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict RulesZhiyong Han0Fortunato Battaglia1Kush Mansuria2Yoav Heyman3Stanley R. Terlecky4Department of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USAThe growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.https://www.mdpi.com/2673-2688/6/1/12large language modelsreasoningrule-followingHIPAA privacy rule
spellingShingle Zhiyong Han
Fortunato Battaglia
Kush Mansuria
Yoav Heyman
Stanley R. Terlecky
Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
AI
large language models
reasoning
rule-following
HIPAA privacy rule
title Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_full Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_fullStr Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_full_unstemmed Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_short Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_sort beyond text generation assessing large language models ability to reason logically and follow strict rules
topic large language models
reasoning
rule-following
HIPAA privacy rule
url https://www.mdpi.com/2673-2688/6/1/12
work_keys_str_mv AT zhiyonghan beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules
AT fortunatobattaglia beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules
AT kushmansuria beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules
AT yoavheyman beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules
AT stanleyrterlecky beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules