Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules

The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts,...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhiyong Han, Fortunato Battaglia, Kush Mansuria, Yoav Heyman, Stanley R. Terlecky
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	AI
Subjects:	large language models reasoning rule-following HIPAA privacy rule
Online Access:	https://www.mdpi.com/2673-2688/6/1/12
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832589409574191104
author	Zhiyong Han Fortunato Battaglia Kush Mansuria Yoav Heyman Stanley R. Terlecky
author_facet	Zhiyong Han Fortunato Battaglia Kush Mansuria Yoav Heyman Stanley R. Terlecky
author_sort	Zhiyong Han
collection	DOAJ
description	The growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.
format	Article
id	doaj-art-76e269521f8742d8853f9a6a607d3411
institution	Kabale University
issn	2673-2688
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	AI
spelling	doaj-art-76e269521f8742d8853f9a6a607d34112025-01-24T13:17:23ZengMDPI AGAI2673-26882025-01-01611210.3390/ai6010012Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict RulesZhiyong Han0Fortunato Battaglia1Kush Mansuria2Yoav Heyman3Stanley R. Terlecky4Department of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USADepartment of Medical Sciences, Hackensack Meridian School of Medicine, Nutley, NJ 07110, USAThe growing interest in advanced large language models (LLMs) like ChatGPT has sparked debate about how best to use them in various human activities. However, a neglected issue in the debate concerning the applications of LLMs is whether they can reason logically and follow rules in novel contexts, which are critical for our understanding and applications of LLMs. To address this knowledge gap, this study investigates five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) using word ladder puzzles to assess their logical reasoning and rule-adherence capabilities. Our two-phase methodology involves (1) explicit instructions about word ladder puzzles and rules regarding how to solve the puzzles and then evaluate rule understanding, followed by (2) assessing LLMs’ ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations as an example of a real-world scenario. Our findings reveal that LLMs show a persistent lack of logical reasoning and systematically fail to follow puzzle rules. Furthermore, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs’ reasoning and rule-following capabilities, raising concerns about their reliability in critical tasks requiring strict rule-following and logical reasoning. Therefore, we urge caution when integrating LLMs into critical fields and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.https://www.mdpi.com/2673-2688/6/1/12large language modelsreasoningrule-followingHIPAA privacy rule
spellingShingle	Zhiyong Han Fortunato Battaglia Kush Mansuria Yoav Heyman Stanley R. Terlecky Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules AI large language models reasoning rule-following HIPAA privacy rule
title	Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_full	Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_fullStr	Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_full_unstemmed	Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_short	Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules
title_sort	beyond text generation assessing large language models ability to reason logically and follow strict rules
topic	large language models reasoning rule-following HIPAA privacy rule
url	https://www.mdpi.com/2673-2688/6/1/12
work_keys_str_mv	AT zhiyonghan beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules AT fortunatobattaglia beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules AT kushmansuria beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules AT yoavheyman beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules AT stanleyrterlecky beyondtextgenerationassessinglargelanguagemodelsabilitytoreasonlogicallyandfollowstrictrules

Beyond Text Generation: Assessing Large Language Models’ Ability to Reason Logically and Follow Strict Rules

Similar Items