Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks

Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction str...

Full description

Saved in:
Bibliographic Details
Main Authors: Grant Wardle, Teo Sušnjak
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/6/149
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849418153202810880
author Grant Wardle
Teo Sušnjak
author_facet Grant Wardle
Teo Sušnjak
author_sort Grant Wardle
collection DOAJ
description Our study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.
format Article
id doaj-art-e67cf9308d1b42e28d008911bd5d0456
institution Kabale University
issn 2504-2289
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Big Data and Cognitive Computing
spelling doaj-art-e67cf9308d1b42e28d008911bd5d04562025-08-20T03:32:31ZengMDPI AGBig Data and Cognitive Computing2504-22892025-06-019614910.3390/bdcc9060149Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning TasksGrant Wardle0Teo Sušnjak1School of Mathematical and Computational Sciences, Massey University, Auckland 0632, New ZealandSchool of Mathematical and Computational Sciences, Massey University, Auckland 0632, New ZealandOur study investigates how the sequencing of text and image inputs within multi-modal prompts affects the reasoning performance of Large Language Models (LLMs). Through empirical evaluations of three major commercial LLM vendors—OpenAI, Google, and Anthropic—alongside a user study on interaction strategies, we develop and validate practical heuristics for optimising multi-modal prompt design. Our findings reveal that modality sequencing is a critical factor influencing reasoning performance, particularly in tasks with varying cognitive load and structural complexity. For simpler tasks involving a single image, positioning the modalities directly impacts model accuracy, whereas in complex, multi-step reasoning scenarios, the sequence must align with the logical structure of inference, often outweighing the specific placement of individual modalities. Furthermore, we identify systematic challenges in multi-hop reasoning within transformer-based architectures, where models demonstrate strong early-stage inference but struggle with integrating prior contextual information in later reasoning steps. Building on these insights, we propose a set of validated, user-centred heuristics for designing effective multi-modal prompts, enhancing both reasoning accuracy and user interaction with AI systems. Our contributions inform the design and usability of interactive intelligent systems, with implications for applications in education, medical imaging, legal document analysis, and customer support. By bridging the gap between intelligent system behaviour and user interaction strategies, this study provides actionable guidance on how users can effectively structure prompts to optimise multi-modal LLM reasoning within real-world, high-stakes decision-making contexts.https://www.mdpi.com/2504-2289/9/6/149multi-modal promptinginteractive AI systemsuser-guided AI adaptationmulti-modal large language modelsmodality fusionmulti-modal reasoning
spellingShingle Grant Wardle
Teo Sušnjak
Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
Big Data and Cognitive Computing
multi-modal prompting
interactive AI systems
user-guided AI adaptation
multi-modal large language models
modality fusion
multi-modal reasoning
title Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_full Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_fullStr Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_full_unstemmed Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_short Image First or Text First? Optimising the Sequencing of Modalities in Large Language Model Prompting and Reasoning Tasks
title_sort image first or text first optimising the sequencing of modalities in large language model prompting and reasoning tasks
topic multi-modal prompting
interactive AI systems
user-guided AI adaptation
multi-modal large language models
modality fusion
multi-modal reasoning
url https://www.mdpi.com/2504-2289/9/6/149
work_keys_str_mv AT grantwardle imagefirstortextfirstoptimisingthesequencingofmodalitiesinlargelanguagemodelpromptingandreasoningtasks
AT teosusnjak imagefirstortextfirstoptimisingthesequencingofmodalitiesinlargelanguagemodelpromptingandreasoningtasks