Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model

Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the hi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Keerthana Murugaraj, Salima Lamsiyah, Christoph Schommer
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Historical text summarization abstractive approach pre-trained HistBERT encoder large language models transfer learning
Online Access:	https://ieeexplore.ieee.org/document/10838535/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832592905382920192
author	Keerthana Murugaraj Salima Lamsiyah Christoph Schommer
author_facet	Keerthana Murugaraj Salima Lamsiyah Christoph Schommer
author_sort	Keerthana Murugaraj
collection	DOAJ
description	Automatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l’Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization.
format	Article
id	doaj-art-0ca324fb59af43dd8daa6b9079617407
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-0ca324fb59af43dd8daa6b90796174072025-01-21T00:01:12ZengIEEEIEEE Access2169-35362025-01-0113109181093210.1109/ACCESS.2025.352873310838535Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained ModelKeerthana Murugaraj0https://orcid.org/0009-0008-5100-055XSalima Lamsiyah1https://orcid.org/0000-0001-8789-5713Christoph Schommer2https://orcid.org/0000-0002-0308-7637Department of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgDepartment of Computer Science, Faculty of Science, Technology and Medicine, University of Luxembourg, Luxembourg City, LuxembourgAutomatic Text Summarization (ATS) systems aim to generate concise summaries of documents while preserving their essential aspects using either extractive or abstractive approaches. Transformer-based ATS methods have achieved success in various domains; however, there is a lack of research in the historical domain. In this paper, we introduce HistBERTSum-Abs, a novel method for abstractive historical single-document summarization. A major challenge in this task is the lack of annotated datasets for historical text summarization. To address this issue, we create a new dataset using archived documents obtained from the Centre Virtuel de la Connaissance sur l’Europe group at the University of Luxembourg. Furthermore, we leverage the potential of HistBERT, a domain-specific bidirectional language model trained on the balanced Corpus of Historical American English, (<uri>https://www.english-corpora.org/coha/</uri>) to capture the semantics of the input documents. Specifically, our method adopts an encoder-decoder architecture, combining the pre-trained HistBERT encoder with a randomly initialized Transformer decoder. To address the mismatch between the pre-trained encoder and the non-pre-trained decoder, we employ a novel fine-tuning schedule that uses different optimizers for each component. Experimental results on our constructed dataset demonstrate that our HistBERTSum-Abs method outperforms recent state-of-the-art deep learning-based methods and achieves results comparable to state-of-the-art LLMs in zero-shot settings in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the first work on abstractive historical text summarization.https://ieeexplore.ieee.org/document/10838535/Historical text summarizationabstractive approachpre-trained HistBERT encoderlarge language modelstransfer learning
spellingShingle	Keerthana Murugaraj Salima Lamsiyah Christoph Schommer Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model IEEE Access Historical text summarization abstractive approach pre-trained HistBERT encoder large language models transfer learning
title	Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_full	Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_fullStr	Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_full_unstemmed	Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_short	Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model
title_sort	abstractive summarization of historical documents a new dataset and novel method using a domain specific pretrained model
topic	Historical text summarization abstractive approach pre-trained HistBERT encoder large language models transfer learning
url	https://ieeexplore.ieee.org/document/10838535/
work_keys_str_mv	AT keerthanamurugaraj abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel AT salimalamsiyah abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel AT christophschommer abstractivesummarizationofhistoricaldocumentsanewdatasetandnovelmethodusingadomainspecificpretrainedmodel

Abstractive Summarization of Historical Documents: A New Dataset and Novel Method Using a Domain-Specific Pretrained Model

Similar Items