Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model

Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the hi...

Full description

Saved in:
Bibliographic Details
Main Authors: Keerthana Murugaraj, Salima Lamsiyah, Christoph Schommer
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10838535/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832592905382920192
author Keerthana Murugaraj
Salima Lamsiyah
Christoph Schommer
author_facet Keerthana Murugaraj
Salima Lamsiyah
Christoph Schommer
author_sort Keerthana Murugaraj
collection DOAJ
description Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l&#x2019;Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization.
format Article
id doaj-art-0ca324fb59af43dd8daa6b9079617407
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-0ca324fb59af43dd8daa6b90796174072025-01-21T00:01:12ZengIEEEIEEE Access2169-35362025-01-0113109181093210.1109/ACCESS.2025.352873310838535Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained ModelKeerthana Murugaraj0https://orcid.org/0009-0008-5100-055XSalima Lamsiyah1https://orcid.org/0000-0001-8789-5713Christoph Schommer2https://orcid.org/0000-0002-0308-7637Department of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgAutomatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l&#x2019;Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization.https://ieeexplore.ieee.org/document/10838535/Historical text summarizationabstractive approachpre-trained HistBERT encoderlarge language modelstransfer learning
spellingShingle Keerthana Murugaraj
Salima Lamsiyah
Christoph Schommer
Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
IEEE Access
Historical text summarization
abstractive approach
pre-trained HistBERT encoder
large language models
transfer learning
title Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_full Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_fullStr Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_full_unstemmed Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_short Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_sort abstractive summarization of historical documents a new dataset and novel method using a domain specific pretrained model
topic Historical text summarization
abstractive approach
pre-trained HistBERT encoder
large language models
transfer learning
url https://ieeexplore.ieee.org/document/10838535/
work_keys_str_mv AT keerthanamurugaraj abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel
AT salimalamsiyah abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel
AT christophschommer abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel