A simple but effective method for Indonesian automatic text summarisation

Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisat...

Full description

Saved in:
Bibliographic Details
Main Authors: Nankai Lin, Jinxian Li, Shengyi Jiang
Format: Article
Language:English
Published: Taylor & Francis Group 2022-12-01
Series:Connection Science
Subjects:
Online Access:http://dx.doi.org/10.1080/09540091.2021.1937942
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849761429714894848
author Nankai Lin
Jinxian Li
Shengyi Jiang
author_facet Nankai Lin
Jinxian Li
Shengyi Jiang
author_sort Nankai Lin
collection DOAJ
description Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable.
format Article
id doaj-art-fb482ed456ef48a4a4bb2c8d0cdb1d20
institution DOAJ
issn 0954-0091
1360-0494
language English
publishDate 2022-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj-art-fb482ed456ef48a4a4bb2c8d0cdb1d202025-08-20T03:06:01ZengTaylor & Francis GroupConnection Science0954-00911360-04942022-12-01341294310.1080/09540091.2021.19379421937942A simple but effective method for Indonesian automatic text summarisationNankai Lin0Jinxian Li1Shengyi Jiang2School of Computer Science and Technology, Guangdong University of Foreign StudiesSchool of Computer Science and Technology, Guangdong University of Foreign StudiesSchool of Computer Science and Technology, Guangdong University of Foreign StudiesAutomatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable.http://dx.doi.org/10.1080/09540091.2021.1937942automatic text summarisationlightgbmindonesianregression
spellingShingle Nankai Lin
Jinxian Li
Shengyi Jiang
A simple but effective method for Indonesian automatic text summarisation
Connection Science
automatic text summarisation
lightgbm
indonesian
regression
title A simple but effective method for Indonesian automatic text summarisation
title_full A simple but effective method for Indonesian automatic text summarisation
title_fullStr A simple but effective method for Indonesian automatic text summarisation
title_full_unstemmed A simple but effective method for Indonesian automatic text summarisation
title_short A simple but effective method for Indonesian automatic text summarisation
title_sort simple but effective method for indonesian automatic text summarisation
topic automatic text summarisation
lightgbm
indonesian
regression
url http://dx.doi.org/10.1080/09540091.2021.1937942
work_keys_str_mv AT nankailin asimplebuteffectivemethodforindonesianautomatictextsummarisation
AT jinxianli asimplebuteffectivemethodforindonesianautomatictextsummarisation
AT shengyijiang asimplebuteffectivemethodforindonesianautomatictextsummarisation
AT nankailin simplebuteffectivemethodforindonesianautomatictextsummarisation
AT jinxianli simplebuteffectivemethodforindonesianautomatictextsummarisation
AT shengyijiang simplebuteffectivemethodforindonesianautomatictextsummarisation