Ensemble automated approaches for producing high‐quality herbarium digital records

Abstract Premise One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via...

Full description

Saved in:
Bibliographic Details
Main Authors: Robert P. Guralnick, Raphael LaFrance, Julie M. Allen, Michael W. Denslow
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:Applications in Plant Sciences
Subjects:
Online Access:https://doi.org/10.1002/aps3.11623
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832542913446281216
author Robert P. Guralnick
Raphael LaFrance
Julie M. Allen
Michael W. Denslow
author_facet Robert P. Guralnick
Raphael LaFrance
Julie M. Allen
Michael W. Denslow
author_sort Robert P. Guralnick
collection DOAJ
description Abstract Premise One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields. Methods We first showcase the development of a rule‐based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule‐based approaches often have high commission error rates. Results Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors. Discussion This work shows that an ensemble approach has particular value for creating high‐quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.
format Article
id doaj-art-b44163dd367e4e2d86d99e70a1411f8c
institution Kabale University
issn 2168-0450
language English
publishDate 2025-01-01
publisher Wiley
record_format Article
series Applications in Plant Sciences
spelling doaj-art-b44163dd367e4e2d86d99e70a1411f8c2025-02-03T12:21:34ZengWileyApplications in Plant Sciences2168-04502025-01-01131n/an/a10.1002/aps3.11623Ensemble automated approaches for producing high‐quality herbarium digital recordsRobert P. Guralnick0Raphael LaFrance1Julie M. Allen2Michael W. Denslow3Florida Museum of Natural History University of Florida Gainesville Florida USAFlorida Museum of Natural History University of Florida Gainesville Florida USADepartment of Biological Sciences VirginiaTech Blacksburg Virginia USAFlorida Museum of Natural History University of Florida Gainesville Florida USAAbstract Premise One of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields. Methods We first showcase the development of a rule‐based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule‐based approaches often have high commission error rates. Results Our results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors. Discussion This work shows that an ensemble approach has particular value for creating high‐quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.https://doi.org/10.1002/aps3.11623ChatGPTdigitizationensemble methodsinformation extractionlarge language modelsmachine learning
spellingShingle Robert P. Guralnick
Raphael LaFrance
Julie M. Allen
Michael W. Denslow
Ensemble automated approaches for producing high‐quality herbarium digital records
Applications in Plant Sciences
ChatGPT
digitization
ensemble methods
information extraction
large language models
machine learning
title Ensemble automated approaches for producing high‐quality herbarium digital records
title_full Ensemble automated approaches for producing high‐quality herbarium digital records
title_fullStr Ensemble automated approaches for producing high‐quality herbarium digital records
title_full_unstemmed Ensemble automated approaches for producing high‐quality herbarium digital records
title_short Ensemble automated approaches for producing high‐quality herbarium digital records
title_sort ensemble automated approaches for producing high quality herbarium digital records
topic ChatGPT
digitization
ensemble methods
information extraction
large language models
machine learning
url https://doi.org/10.1002/aps3.11623
work_keys_str_mv AT robertpguralnick ensembleautomatedapproachesforproducinghighqualityherbariumdigitalrecords
AT raphaellafrance ensembleautomatedapproachesforproducinghighqualityherbariumdigitalrecords
AT juliemallen ensembleautomatedapproachesforproducinghighqualityherbariumdigitalrecords
AT michaelwdenslow ensembleautomatedapproachesforproducinghighqualityherbariumdigitalrecords