Automation of historical weather data rescue

Abstract Data rescuers worldwide have been trying to retrieve millions of valuable weather historical records so the observations contained in those records are preserved, searchable, analysable and machine readable. The majority of the records are written by hand, in print or cursive handwriting. A...

Full description

Saved in:
Bibliographic Details
Main Authors: Y. Zhang, R. E. Sieber
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Geoscience Data Journal
Subjects:
Online Access:https://doi.org/10.1002/gdj3.261
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Data rescuers worldwide have been trying to retrieve millions of valuable weather historical records so the observations contained in those records are preserved, searchable, analysable and machine readable. The majority of the records are written by hand, in print or cursive handwriting. Automatic transcriptions to date have not been reliable or sufficiently accurate on handwritten data so most of the historical records are transcribed manually. Recent attempts integrate artificial intelligence (AI) to automatically transcribe the historical records but the results have not been promising. Currently there is no end‐to‐end workflow to automatically transcribe historical handwritten tabular records into digital datasets. We propose a workflow that uses AI to automate the handwriting transcription process. The workflow is tested using the historical climate records from the Data Rescue: Archives and Weather (DRAW) project. This workflow is composed of five steps: (1) image pre‐processing, (2) text line segmentation, (3) bounding boxes detection, (4) AI‐enabled optical character recognition (OCR) and (5) layout re‐arrangement. These steps are modular to better accommodate future advances (e.g., new image training data, better layout detectors). We hope the workflow proposed can serve as a guideline that is easily replicable and can be utilized to transcribe other historical datasets.
ISSN:2049-6060