Better Skeleton Better Readability: Scene Text Image Super-Resolution via Skeleton-Aware Diffusion Model

Scene text image super-resolution (STISR) aims to enhance the resolution of text images while simultaneously improving their readability by reducing noise, blur, and other degradations. Existing diffusion-based approaches for STISR primarily rely on text-prior information but often overlook the impo...

Full description

Saved in:
Bibliographic Details
Main Authors: Shrey Singh, Prateek Keserwani, Partha Pratim Roy, Rajkumar Saini
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10772209/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Scene text image super-resolution (STISR) aims to enhance the resolution of text images while simultaneously improving their readability by reducing noise, blur, and other degradations. Existing diffusion-based approaches for STISR primarily rely on text-prior information but often overlook the importance of explicitly modeling the visual structure of the text. In this paper, we propose a novel Skeleton-Aware Diffusion Method (SADM) for STISR, which introduces text skeletons as structural guidance to the diffusion process. The text skeleton serves as a critical visual cue, helping the model to better restore the fine details of text, even in severely degraded low-resolution images. Generating high-quality skeletons from low-resolution scene text is a challenging task due to the inherent blurring and noise present in such images. To tackle this, we introduce a diffusion-based Skeleton Correction Network (SCN), which refines the initial skeletons produced by a convolutional neural network-based skeletonization model. The SCN effectively improves the accuracy of the skeletons, allowing for more precise structural guidance during the diffusion process. Our extensive experiments demonstrate the significant benefits of incorporating skeleton information into the STISR pipeline. The proposed SADM achieves state-of-the-art performance on the TextZoom dataset, with accuracies of 81.4%, 64.9%, and 49.6% on the easy, medium, and hard subsets, respectively, compared to the previous best results by ASTER text recognizer. Through detailed analysis, we also show that improving the quality of skeletons from low-resolution images leads to better super-resolution outcomes and enhances the performance of text recognizers.
ISSN:2169-3536