Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data

This paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a...

Full description

Saved in:
Bibliographic Details
Main Authors: Majdi Sawalha, Faisal Al-Shargi, Sane Yagi, Abdallah T. AlShdaifat, Bassam Hammo, Mariam Belajeed, Lubna R. Al-Ogaili
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924011739
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576478632476672
author Majdi Sawalha
Faisal Al-Shargi
Sane Yagi
Abdallah T. AlShdaifat
Bassam Hammo
Mariam Belajeed
Lubna R. Al-Ogaili
author_facet Majdi Sawalha
Faisal Al-Shargi
Sane Yagi
Abdallah T. AlShdaifat
Bassam Hammo
Mariam Belajeed
Lubna R. Al-Ogaili
author_sort Majdi Sawalha
collection DOAJ
description This paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (tab-separated text file (tsv), SQLite3 database (.db), comma-separated file (csv), and JavaScript Object Notation (.JSON)) to cater to various research needs. MASAQ's unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, which governs its use and distribution. It has been created in compliance with ethical guidelines and with respect for the integrity of the Quranic text.
format Article
id doaj-art-6cdd1d24967145d8a41441bc79aab34f
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-6cdd1d24967145d8a41441bc79aab34f2025-01-31T05:11:31ZengElsevierData in Brief2352-34092025-02-0158111211Morphologically-analyzed and syntactically-annotated Quran datasetMendeley DataMajdi Sawalha0Faisal Al-Shargi1Sane Yagi2Abdallah T. AlShdaifat3Bassam Hammo4Mariam Belajeed5Lubna R. Al-Ogaili6College of Engineering, Al-Ain University, Abu Dhabi, UAE; King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan; Corresponding author.Amazon Robotics, New York, USADepartment of Foreign Languages, University of Sharjah, Sharjah, UAE; English Department, The University of Jordan, Amman, JordanCollege of Arts and Languages, Mohamed bin Zayed University for Humanities, Abu Dhabi, UAEKing Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan; School of Computing Sciences, Princess Sumaya University for Technology, Amman, JordanArabic Department, University of Sharjah, UAEArabic Department, University of Sharjah, UAEThis paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (tab-separated text file (tsv), SQLite3 database (.db), comma-separated file (csv), and JavaScript Object Notation (.JSON)) to cater to various research needs. MASAQ's unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, which governs its use and distribution. It has been created in compliance with ethical guidelines and with respect for the integrity of the Quranic text.http://www.sciencedirect.com/science/article/pii/S2352340924011739Syntactic annotationmorphological annotationsyntactic relationssemantic relationsi'rab إعراب (ʾi‘rāb)analysis
spellingShingle Majdi Sawalha
Faisal Al-Shargi
Sane Yagi
Abdallah T. AlShdaifat
Bassam Hammo
Mariam Belajeed
Lubna R. Al-Ogaili
Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
Data in Brief
Syntactic annotation
morphological annotation
syntactic relations
semantic relations
i'rab إعراب (ʾi‘rāb)
analysis
title Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
title_full Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
title_fullStr Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
title_full_unstemmed Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
title_short Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
title_sort morphologically analyzed and syntactically annotated quran datasetmendeley data
topic Syntactic annotation
morphological annotation
syntactic relations
semantic relations
i'rab إعراب (ʾi‘rāb)
analysis
url http://www.sciencedirect.com/science/article/pii/S2352340924011739
work_keys_str_mv AT majdisawalha morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT faisalalshargi morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT saneyagi morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT abdallahtalshdaifat morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT bassamhammo morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT mariambelajeed morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata
AT lubnaralogaili morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata