Low-Resource Active Learning of Morphological Segmentation

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for ma...

Full description

Saved in:
Bibliographic Details
Main Authors: Stig-Arne Grönroos, Katri Hiovain, Peter Smit, Ilona Rauhala, Kristiina Jokinen, Mikko Kurimo, Sami Virpioja
Format: Article
Language:English
Published: Linköping University Electronic Press 2016-03-01
Series:Northern European Journal of Language Technology
Online Access:https://nejlt.ep.liu.se/article/view/1662
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832590632540962816
author Stig-Arne Grönroos
Katri Hiovain
Peter Smit
Ilona Rauhala
Kristiina Jokinen
Mikko Kurimo
Sami Virpioja
author_facet Stig-Arne Grönroos
Katri Hiovain
Peter Smit
Ilona Rauhala
Kristiina Jokinen
Mikko Kurimo
Sami Virpioja
author_sort Stig-Arne Grönroos
collection DOAJ
description Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
format Article
id doaj-art-c228429cf48a4c35ac975b3458310e1d
institution Kabale University
issn 2000-1533
language English
publishDate 2016-03-01
publisher Linköping University Electronic Press
record_format Article
series Northern European Journal of Language Technology
spelling doaj-art-c228429cf48a4c35ac975b3458310e1d2025-01-23T10:36:33ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332016-03-01410.3384/nejlt.2000-1533.1644Low-Resource Active Learning of Morphological SegmentationStig-Arne Grönroos0Katri Hiovain1Peter Smit2Ilona Rauhala3Kristiina Jokinen4Mikko Kurimo5Sami Virpioja6Department of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandDepartment of Computer Science, Aalto University, Finland Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. https://nejlt.ep.liu.se/article/view/1662
spellingShingle Stig-Arne Grönroos
Katri Hiovain
Peter Smit
Ilona Rauhala
Kristiina Jokinen
Mikko Kurimo
Sami Virpioja
Low-Resource Active Learning of Morphological Segmentation
Northern European Journal of Language Technology
title Low-Resource Active Learning of Morphological Segmentation
title_full Low-Resource Active Learning of Morphological Segmentation
title_fullStr Low-Resource Active Learning of Morphological Segmentation
title_full_unstemmed Low-Resource Active Learning of Morphological Segmentation
title_short Low-Resource Active Learning of Morphological Segmentation
title_sort low resource active learning of morphological segmentation
url https://nejlt.ep.liu.se/article/view/1662
work_keys_str_mv AT stigarnegronroos lowresourceactivelearningofmorphologicalsegmentation
AT katrihiovain lowresourceactivelearningofmorphologicalsegmentation
AT petersmit lowresourceactivelearningofmorphologicalsegmentation
AT ilonarauhala lowresourceactivelearningofmorphologicalsegmentation
AT kristiinajokinen lowresourceactivelearningofmorphologicalsegmentation
AT mikkokurimo lowresourceactivelearningofmorphologicalsegmentation
AT samivirpioja lowresourceactivelearningofmorphologicalsegmentation