A performance-driven hybrid text-image classification model for multimodal data

Abstract Deep learning is transformed by a hybrid model combining text and image processing in a transforming approach to categorization chores. It addresses multi-type data—that which combines text and visual data. The HTIC model is presented in this work along with text and image categorization. T...

Full description

Saved in:
Bibliographic Details
Main Authors: Swati Gupta, Bal Kishan
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-95674-8
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Deep learning is transformed by a hybrid model combining text and image processing in a transforming approach to categorization chores. It addresses multi-type data—that which combines text and visual data. The HTIC model is presented in this work along with text and image categorization. These models center on a complex deep learning architecture including VGG16 for image classification and Roberta with MYSQL for text classification and optimized CNNs. Roberta is very good at extracting valuable information from both textual embeddings and at capturing complex patterns across many modalities. By using multi-modal feature extraction layers, which guarantee that different kinds of data are compatible, image representations attain this. As hybrid modeling develops, these models might transform classification accuracy, interpretability, and application in the multimodal data analysis age. This work compares the efficiency of an HTIC model with other classification methods by looking at its applicability across five different datasets. This work found that the method outperforms other algorithms in general. Because HTIC has better ability to generalize and resist changes in input data, it is well-suited for real-world uses like NFT dataset used in this work.
ISSN:2045-2322