Low-Resource Active Learning of Morphological Segmentation

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for ma...

Full description

Saved in:

Bibliographic Details
Main Authors:	Stig-Arne Grönroos, Katri Hiovain, Peter Smit, Ilona Rauhala, Kristiina Jokinen, Mikko Kurimo, Sami Virpioja
Format:	Article
Language:	English
Published:	Linköping University Electronic Press 2016-03-01
Series:	Northern European Journal of Language Technology
Online Access:	https://nejlt.ep.liu.se/article/view/1662
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832590632540962816
author	Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja
author_facet	Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja
author_sort	Stig-Arne Grönroos
collection	DOAJ
description	Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
format	Article
id	doaj-art-c228429cf48a4c35ac975b3458310e1d
institution	Kabale University
issn	2000-1533
language	English
publishDate	2016-03-01
publisher	Linköping University Electronic Press
record_format	Article
series	Northern European Journal of Language Technology
spelling	doaj-art-c228429cf48a4c35ac975b3458310e1d2025-01-23T10:36:33ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332016-03-01410.3384/nejlt.2000-1533.1644Low-Resource Active Learning of Morphological SegmentationStig-Arne Grönroos0Katri Hiovain1Peter Smit2Ilona Rauhala3Kristiina Jokinen4Mikko Kurimo5Sami Virpioja6Department of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandDepartment of Computer Science, Aalto University, Finland Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. https://nejlt.ep.liu.se/article/view/1662
spellingShingle	Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja Low-Resource Active Learning of Morphological Segmentation Northern European Journal of Language Technology
title	Low-Resource Active Learning of Morphological Segmentation
title_full	Low-Resource Active Learning of Morphological Segmentation
title_fullStr	Low-Resource Active Learning of Morphological Segmentation
title_full_unstemmed	Low-Resource Active Learning of Morphological Segmentation
title_short	Low-Resource Active Learning of Morphological Segmentation
title_sort	low resource active learning of morphological segmentation
url	https://nejlt.ep.liu.se/article/view/1662
work_keys_str_mv	AT stigarnegronroos lowresourceactivelearningofmorphologicalsegmentation AT katrihiovain lowresourceactivelearningofmorphologicalsegmentation AT petersmit lowresourceactivelearningofmorphologicalsegmentation AT ilonarauhala lowresourceactivelearningofmorphologicalsegmentation AT kristiinajokinen lowresourceactivelearningofmorphologicalsegmentation AT mikkokurimo lowresourceactivelearningofmorphologicalsegmentation AT samivirpioja lowresourceactivelearningofmorphologicalsegmentation

Low-Resource Active Learning of Morphological Segmentation

Similar Items