Low-Resource Active Learning of Morphological Segmentation
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for ma...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Linköping University Electronic Press
2016-03-01
|
Series: | Northern European Journal of Language Technology |
Online Access: | https://nejlt.ep.liu.se/article/view/1662 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832590632540962816 |
---|---|
author | Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja |
author_facet | Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja |
author_sort | Stig-Arne Grönroos |
collection | DOAJ |
description |
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
|
format | Article |
id | doaj-art-c228429cf48a4c35ac975b3458310e1d |
institution | Kabale University |
issn | 2000-1533 |
language | English |
publishDate | 2016-03-01 |
publisher | Linköping University Electronic Press |
record_format | Article |
series | Northern European Journal of Language Technology |
spelling | doaj-art-c228429cf48a4c35ac975b3458310e1d2025-01-23T10:36:33ZengLinköping University Electronic PressNorthern European Journal of Language Technology2000-15332016-03-01410.3384/nejlt.2000-1533.1644Low-Resource Active Learning of Morphological SegmentationStig-Arne Grönroos0Katri Hiovain1Peter Smit2Ilona Rauhala3Kristiina Jokinen4Mikko Kurimo5Sami Virpioja6Department of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandInstitute of Behavioural Sciences, University of Helsinki, FinlandDepartment of Signal Processing and Acoustics, Aalto University, FinlandDepartment of Computer Science, Aalto University, Finland Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection. https://nejlt.ep.liu.se/article/view/1662 |
spellingShingle | Stig-Arne Grönroos Katri Hiovain Peter Smit Ilona Rauhala Kristiina Jokinen Mikko Kurimo Sami Virpioja Low-Resource Active Learning of Morphological Segmentation Northern European Journal of Language Technology |
title | Low-Resource Active Learning of Morphological Segmentation |
title_full | Low-Resource Active Learning of Morphological Segmentation |
title_fullStr | Low-Resource Active Learning of Morphological Segmentation |
title_full_unstemmed | Low-Resource Active Learning of Morphological Segmentation |
title_short | Low-Resource Active Learning of Morphological Segmentation |
title_sort | low resource active learning of morphological segmentation |
url | https://nejlt.ep.liu.se/article/view/1662 |
work_keys_str_mv | AT stigarnegronroos lowresourceactivelearningofmorphologicalsegmentation AT katrihiovain lowresourceactivelearningofmorphologicalsegmentation AT petersmit lowresourceactivelearningofmorphologicalsegmentation AT ilonarauhala lowresourceactivelearningofmorphologicalsegmentation AT kristiinajokinen lowresourceactivelearningofmorphologicalsegmentation AT mikkokurimo lowresourceactivelearningofmorphologicalsegmentation AT samivirpioja lowresourceactivelearningofmorphologicalsegmentation |