Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction str...

Full description

Saved in:

Bibliographic Details
Main Authors:	Grant Wardle, Teo Sušnjak
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Big Data and Cognitive Computing
Subjects:	multi-modal prompting interactive AI systems user-guided AI adaptation multi-modal large language models modality fusion multi-modal reasoning
Online Access:	https://www.mdpi.com/2504-2289/9/6/149
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849418153202810880
author	Grant Wardle Teo Sušnjak
author_facet	Grant Wardle Teo Sušnjak
author_sort	Grant Wardle
collection	DOAJ
description	Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.
format	Article
id	doaj-art-e67cf9308d1b42e28d008911bd5d0456
institution	Kabale University
issn	2504-2289
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Big Data and Cognitive Computing
spelling	doaj-art-e67cf9308d1b42e28d008911bd5d04562025-08-20T03:32:31ZengMDPI AGBig Data and Cognitive Computing2504-22892025-06-019614910.3390/bdcc9060149Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning TasksGrant Wardle0Teo Sušnjak1School of Mathematical and Computational Sciences, Massey University, Auckland 0632, New ZealandSchool of Mathematical and Computational Sciences, Massey University, Auckland 0632, New ZealandOur study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.https://www.mdpi.com/2504-2289/9/6/149multi-modal promptinginteractive AI systemsuser-guided AI adaptationmulti-modal large language modelsmodality fusionmulti-modal reasoning
spellingShingle	Grant Wardle Teo Sušnjak Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks Big Data and Cognitive Computing multi-modal prompting interactive AI systems user-guided AI adaptation multi-modal large language models modality fusion multi-modal reasoning
title	Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_full	Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_fullStr	Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_full_unstemmed	Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_short	Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_sort	image first or text first optimising the sequencing of modalities in large language model prompting and reasoning tasks
topic	multi-modal prompting interactive AI systems user-guided AI adaptation multi-modal large language models modality fusion multi-modal reasoning
url	https://www.mdpi.com/2504-2289/9/6/149
work_keys_str_mv	AT grantwardle imagefirstortextfirstoptimisingthesequencingofmodalitiesinlargelanguagemodelpromptingandreasoningtasks AT teosusnjak imagefirstortextfirstoptimisingthesequencingofmodalitiesinlargelanguagemodelpromptingandreasoningtasks

Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Similar Items