Stemming of Slovenian library science texts

Abstract The theme of the article is the preparation of a stemming algorithm for Slovenian library science texts. The procedure consisted of three phases: learning, testing and evaluation. The preparation of the optimal stemmer for Slovenian texts from the field of library science is presented, its...

Full description

Saved in:
Bibliographic Details
Main Authors: Polona Vilar, Jasna Maver
Format: Article
Language:English
Published: Slovenian Library Association & University of Ljubljana Press (Založba Univerze v Ljubljani) 2002-01-01
Series:Knjižnica
Subjects:
Online Access:https://journals.uni-lj.si/knjiznica/article/view/14008
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract The theme of the article is the preparation of a stemming algorithm for Slovenian library science texts. The procedure consisted of three phases: learning, testing and evaluation. The preparation of the optimal stemmer for Slovenian texts from the field of library science is presented, its testing and comparison with two other stemmers for the Slovenian language: the Popovič stemmer and the Generic stemmer. A corpus of 790.000 words from the field of library science was used for learning. Lists of stems, word endings and stop-words were built. In the testing phase, the component parts of the algorithm were tested on an additional corpus of 167.000 words. In the evaluation phase, a comparison of the three stemmers processing the same word corpus was made. The results of each stemmer were compared with an intellectually prepared control result of the stemming of the corpus. It consisted of groups of semantically connected words with no errors. Understemming was especially monitored – the number of stems for semantically connected words, produced by an algorithm. The results were statistically processed with the Kruskal-Wallis test. The Optimal stemmer produced the best results. It matched best with the reference results and also gave the smallest number of stems for one semantic meaning. The Popovič stemmer followed closely. The Generic stemmer proved to be the least accurate. The procedures described in the thesis can represent a platform for the development of the tools for automatic indexing and retrieval for library science texts in Slovenian language.
ISSN:0023-2424
1581-7903