Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
We present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the c...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
University of Borås
2002-01-01
|
Series: | Information Research: An International Electronic Journal |
Online Access: | http://informationr.net/ir/7-2/paper126.html |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832570021073649664 |
---|---|
author | Ari Pirkola Heikki Keskustalo Erkka Leppänen Antti-Pekka Känsälä Kalervo Järvelin |
author_facet | Ari Pirkola Heikki Keskustalo Erkka Leppänen Antti-Pekka Känsälä Kalervo Järvelin |
author_sort | Ari Pirkola |
collection | DOAJ |
description | We present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied. English, German, and Swedish query keys were matched against their Finnish spelling variants and Finnish morphological variants using a target word list of 119 000 Finnish words. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique. The technique was highly effective also for monolingual word form variants. The effects of query key length and the length of the longest common subsequence (LCS) of the variants on the performance of s-grams were analyzed. |
format | Article |
id | doaj-art-a8b2685a6b2247878a5fa6b7ce332b39 |
institution | Kabale University |
issn | 1368-1613 |
language | English |
publishDate | 2002-01-01 |
publisher | University of Borås |
record_format | Article |
series | Information Research: An International Electronic Journal |
spelling | doaj-art-a8b2685a6b2247878a5fa6b7ce332b392025-02-02T17:57:01ZengUniversity of BoråsInformation Research: An International Electronic Journal1368-16132002-01-0172126Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variantsAri PirkolaHeikki KeskustaloErkka LeppänenAntti-Pekka KänsäläKalervo JärvelinWe present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied. English, German, and Swedish query keys were matched against their Finnish spelling variants and Finnish morphological variants using a target word list of 119 000 Finnish words. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique. The technique was highly effective also for monolingual word form variants. The effects of query key length and the length of the longest common subsequence (LCS) of the variants on the performance of s-grams were analyzed.http://informationr.net/ir/7-2/paper126.html |
spellingShingle | Ari Pirkola Heikki Keskustalo Erkka Leppänen Antti-Pekka Känsälä Kalervo Järvelin Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants Information Research: An International Electronic Journal |
title | Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants |
title_full | Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants |
title_fullStr | Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants |
title_full_unstemmed | Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants |
title_short | Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants |
title_sort | targeted s gram matching a novel n gram matching technique for cross and monolingual word form variants |
url | http://informationr.net/ir/7-2/paper126.html |
work_keys_str_mv | AT aripirkola targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants AT heikkikeskustalo targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants AT erkkaleppanen targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants AT anttipekkakansala targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants AT kalervojarvelin targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants |