A simple but effective method for Indonesian automatic text summarisation
Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisat...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2022-12-01
|
| Series: | Connection Science |
| Subjects: | |
| Online Access: | http://dx.doi.org/10.1080/09540091.2021.1937942 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849761429714894848 |
|---|---|
| author | Nankai Lin Jinxian Li Shengyi Jiang |
| author_facet | Nankai Lin Jinxian Li Shengyi Jiang |
| author_sort | Nankai Lin |
| collection | DOAJ |
| description | Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable. |
| format | Article |
| id | doaj-art-fb482ed456ef48a4a4bb2c8d0cdb1d20 |
| institution | DOAJ |
| issn | 0954-0091 1360-0494 |
| language | English |
| publishDate | 2022-12-01 |
| publisher | Taylor & Francis Group |
| record_format | Article |
| series | Connection Science |
| spelling | doaj-art-fb482ed456ef48a4a4bb2c8d0cdb1d202025-08-20T03:06:01ZengTaylor & Francis GroupConnection Science0954-00911360-04942022-12-01341294310.1080/09540091.2021.19379421937942A simple but effective method for Indonesian automatic text summarisationNankai Lin0Jinxian Li1Shengyi Jiang2School of Computer Science and Technology, Guangdong University of Foreign StudiesSchool of Computer Science and Technology, Guangdong University of Foreign StudiesSchool of Computer Science and Technology, Guangdong University of Foreign StudiesAutomatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable.http://dx.doi.org/10.1080/09540091.2021.1937942automatic text summarisationlightgbmindonesianregression |
| spellingShingle | Nankai Lin Jinxian Li Shengyi Jiang A simple but effective method for Indonesian automatic text summarisation Connection Science automatic text summarisation lightgbm indonesian regression |
| title | A simple but effective method for Indonesian automatic text summarisation |
| title_full | A simple but effective method for Indonesian automatic text summarisation |
| title_fullStr | A simple but effective method for Indonesian automatic text summarisation |
| title_full_unstemmed | A simple but effective method for Indonesian automatic text summarisation |
| title_short | A simple but effective method for Indonesian automatic text summarisation |
| title_sort | simple but effective method for indonesian automatic text summarisation |
| topic | automatic text summarisation lightgbm indonesian regression |
| url | http://dx.doi.org/10.1080/09540091.2021.1937942 |
| work_keys_str_mv | AT nankailin asimplebuteffectivemethodforindonesianautomatictextsummarisation AT jinxianli asimplebuteffectivemethodforindonesianautomatictextsummarisation AT shengyijiang asimplebuteffectivemethodforindonesianautomatictextsummarisation AT nankailin simplebuteffectivemethodforindonesianautomatictextsummarisation AT jinxianli simplebuteffectivemethodforindonesianautomatictextsummarisation AT shengyijiang simplebuteffectivemethodforindonesianautomatictextsummarisation |