Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model tha...

Full description

Saved in:

Bibliographic Details
Main Authors:	Bjørn-Richard Pedersen, Einar Holsbø, Trygve Andersen, Nikita Shvetsov, Johan Ravn, Hilde Leikny Sommerseth, Lars Ailo Bongo
Format:	Article
Language:	English
Published:	International Institute of Social History 2022-01-01
Series:	Historical Life Course Studies
Subjects:	Machine learning Historical data Pipeline Census 1950 Norway
Online Access:	https://hlcs.nl/article/view/11331
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832570143208636416
author	Bjørn-Richard Pedersen Einar Holsbø Trygve Andersen Nikita Shvetsov Johan Ravn Hilde Leikny Sommerseth Lars Ailo Bongo
author_facet	Bjørn-Richard Pedersen Einar Holsbø Trygve Andersen Nikita Shvetsov Johan Ravn Hilde Leikny Sommerseth Lars Ailo Bongo
author_sort	Bjørn-Richard Pedersen
collection	DOAJ
description	Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification . We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at https://github.com/uit-hdl/rhd-codes.
format	Article
id	doaj-art-7b9631dad0dc480f99c04cbe265b4f3a
institution	Kabale University
issn	2352-6343
language	English
publishDate	2022-01-01
publisher	International Institute of Social History
record_format	Article
series	Historical Life Course Studies
spelling	doaj-art-7b9631dad0dc480f99c04cbe265b4f3a2025-02-02T16:24:34ZengInternational Institute of Social HistoryHistorical Life Course Studies2352-63432022-01-011210.51964/hlcs11331Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation CodesBjørn-Richard Pedersen0Einar Holsbø1Trygve Andersen2Nikita Shvetsov3Johan Ravn4Hilde Leikny Sommerseth5Lars Ailo Bongo6Norwegian Historical Data Centre, UiT The Arctic University of NorwayDepartment of Computer Science, UiT The Arctic University of NorwayNorwegian Historical Data Centre, UiT The Arctic University of NorwayDepartment of Computer Science, UiT The Arctic University of NorwayMedsensio AS, Tromsø, NorwayNorwegian Historical Data Centre, UiT The Arctic University of NorwayDepartment of Computer Science, UiT The Arctic University of NorwayMachine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification . We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at https://github.com/uit-hdl/rhd-codes.https://hlcs.nl/article/view/11331Machine learningHistorical dataPipelineCensus1950Norway
spellingShingle	Bjørn-Richard Pedersen Einar Holsbø Trygve Andersen Nikita Shvetsov Johan Ravn Hilde Leikny Sommerseth Lars Ailo Bongo Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes Historical Life Course Studies Machine learning Historical data Pipeline Census 1950 Norway
title	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes
title_full	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes
title_fullStr	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes
title_full_unstemmed	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes
title_short	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes
title_sort	lessons learned developing and using a machine learning model to automatically transcribe 2 3 million handwritten occupation codes
topic	Machine learning Historical data Pipeline Census 1950 Norway
url	https://hlcs.nl/article/view/11331
work_keys_str_mv	AT bjørnrichardpedersen lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT einarholsbø lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT trygveandersen lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT nikitashvetsov lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT johanravn lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT hildeleiknysommerseth lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes AT larsailobongo lessonslearneddevelopingandusingamachinelearningmodeltoautomaticallytranscribe23millionhandwrittenoccupationcodes

Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Similar Items