Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the hi...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10838535/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592905382920192 |
---|---|
author | Keerthana Murugaraj Salima Lamsiyah Christoph Schommer |
author_facet | Keerthana Murugaraj Salima Lamsiyah Christoph Schommer |
author_sort | Keerthana Murugaraj |
collection | DOAJ |
description | Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l’Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization. |
format | Article |
id | doaj-art-0ca324fb59af43dd8daa6b9079617407 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-0ca324fb59af43dd8daa6b90796174072025-01-21T00:01:12ZengIEEEIEEE Access2169-35362025-01-0113109181093210.1109/ACCESS.2025.352873310838535Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained ModelKeerthana Murugaraj0https://orcid.org/0009-0008-5100-055XSalima Lamsiyah1https://orcid.org/0000-0001-8789-5713Christoph Schommer2https://orcid.org/0000-0002-0308-7637Department of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgAutomatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l’Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization.https://ieeexplore.ieee.org/document/10838535/Historical text summarizationabstractive approachpre-trained HistBERT encoderlarge language modelstransfer learning |
spellingShingle | Keerthana Murugaraj Salima Lamsiyah Christoph Schommer Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model IEEE Access Historical text summarization abstractive approach pre-trained HistBERT encoder large language models transfer learning |
title | Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model |
title_full | Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model |
title_fullStr | Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model |
title_full_unstemmed | Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model |
title_short | Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model |
title_sort | abstractive summarization of historical documents a new dataset and novel method using a domain specific pretrained model |
topic | Historical text summarization abstractive approach pre-trained HistBERT encoder large language models transfer learning |
url | https://ieeexplore.ieee.org/document/10838535/ |
work_keys_str_mv | AT keerthanamurugaraj abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel AT salimalamsiyah abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel AT christophschommer abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel |