Automating Historical Source Transcription

Transcribing the 1950 Norwegian census with 3.3 million person records and linking it to the Central Population Register (CPR) provides longitudinal information about significant population groups during the understudied period of the mid-20th century. Since this source is closed to the public, we r...

Full description

Saved in:
Bibliographic Details
Main Author: Gunnar Thorvaldsen
Format: Article
Language:English
Published: International Institute of Social History 2021-03-01
Series:Historical Life Course Studies
Subjects:
Online Access:https://openjournals.nl/index.php/hlcs/article/view/9568
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832570821421301760
author Gunnar Thorvaldsen
author_facet Gunnar Thorvaldsen
author_sort Gunnar Thorvaldsen
collection DOAJ
description Transcribing the 1950 Norwegian census with 3.3 million person records and linking it to the Central Population Register (CPR) provides longitudinal information about significant population groups during the understudied period of the mid-20th century. Since this source is closed to the public, we receive no help from genealogists and rather use machine learning techniques to semi-automate the transcription. First the scanned manuscripts are split into individual cells and multiple names are divided. After the birthdates were transcribed manually in India, a lookup routine searches for families with matching sets of birthdates in the 1960 census and the CPR. After manual checks with GUI routines, the names are copied to the text version of the 1950 census, also storing the links to the CPR. Other fields like occupations or gender contain numeric or letter codes and are transcribed wholesale with routines interpreting the layout of the graphical images. Work employing these methods has also started on the 1930 census, which is the last of the Norwegian censuses to be transcribed.
format Article
id doaj-art-303e86f684524daeb77728a38e690a02
institution Kabale University
issn 2352-6343
language English
publishDate 2021-03-01
publisher International Institute of Social History
record_format Article
series Historical Life Course Studies
spelling doaj-art-303e86f684524daeb77728a38e690a022025-02-02T13:50:58ZengInternational Institute of Social HistoryHistorical Life Course Studies2352-63432021-03-011010.51964/hlcs9568Automating Historical Source TranscriptionGunnar ThorvaldsenTranscribing the 1950 Norwegian census with 3.3 million person records and linking it to the Central Population Register (CPR) provides longitudinal information about significant population groups during the understudied period of the mid-20th century. Since this source is closed to the public, we receive no help from genealogists and rather use machine learning techniques to semi-automate the transcription. First the scanned manuscripts are split into individual cells and multiple names are divided. After the birthdates were transcribed manually in India, a lookup routine searches for families with matching sets of birthdates in the 1960 census and the CPR. After manual checks with GUI routines, the names are copied to the text version of the 1950 census, also storing the links to the CPR. Other fields like occupations or gender contain numeric or letter codes and are transcribed wholesale with routines interpreting the layout of the graphical images. Work employing these methods has also started on the 1930 census, which is the last of the Norwegian censuses to be transcribed.https://openjournals.nl/index.php/hlcs/article/view/9568CensusTranscriptionMachine learningPopulation register
spellingShingle Gunnar Thorvaldsen
Automating Historical Source Transcription
Historical Life Course Studies
Census
Transcription
Machine learning
Population register
title Automating Historical Source Transcription
title_full Automating Historical Source Transcription
title_fullStr Automating Historical Source Transcription
title_full_unstemmed Automating Historical Source Transcription
title_short Automating Historical Source Transcription
title_sort automating historical source transcription
topic Census
Transcription
Machine learning
Population register
url https://openjournals.nl/index.php/hlcs/article/view/9568
work_keys_str_mv AT gunnarthorvaldsen automatinghistoricalsourcetranscription