Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data
This paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-02-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340924011739 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832576478632476672 |
---|---|
author | Majdi Sawalha Faisal Al-Shargi Sane Yagi Abdallah T. AlShdaifat Bassam Hammo Mariam Belajeed Lubna R. Al-Ogaili |
author_facet | Majdi Sawalha Faisal Al-Shargi Sane Yagi Abdallah T. AlShdaifat Bassam Hammo Mariam Belajeed Lubna R. Al-Ogaili |
author_sort | Majdi Sawalha |
collection | DOAJ |
description | This paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (tab-separated text file (tsv), SQLite3 database (.db), comma-separated file (csv), and JavaScript Object Notation (.JSON)) to cater to various research needs. MASAQ's unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, which governs its use and distribution. It has been created in compliance with ethical guidelines and with respect for the integrity of the Quranic text. |
format | Article |
id | doaj-art-6cdd1d24967145d8a41441bc79aab34f |
institution | Kabale University |
issn | 2352-3409 |
language | English |
publishDate | 2025-02-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj-art-6cdd1d24967145d8a41441bc79aab34f2025-01-31T05:11:31ZengElsevierData in Brief2352-34092025-02-0158111211Morphologically-analyzed and syntactically-annotated Quran datasetMendeley DataMajdi Sawalha0Faisal Al-Shargi1Sane Yagi2Abdallah T. AlShdaifat3Bassam Hammo4Mariam Belajeed5Lubna R. Al-Ogaili6College of Engineering, Al-Ain University, Abu Dhabi, UAE; King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan; Corresponding author.Amazon Robotics, New York, USADepartment of Foreign Languages, University of Sharjah, Sharjah, UAE; English Department, The University of Jordan, Amman, JordanCollege of Arts and Languages, Mohamed bin Zayed University for Humanities, Abu Dhabi, UAEKing Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan; School of Computing Sciences, Princess Sumaya University for Technology, Amman, JordanArabic Department, University of Sharjah, UAEArabic Department, University of Sharjah, UAEThis paper introduces the Morphologically-Analyzed and Syntactically-Annotated Quran (MASAQ) dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text, utilizing a rigorously verified text from Tanzil.net. The dataset includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. The annotation process involved a team of expert Arabic linguists who employed traditional i'rab methodologies to ensure high accuracy and consistency. The dataset is structured in multiple formats (tab-separated text file (tsv), SQLite3 database (.db), comma-separated file (csv), and JavaScript Object Notation (.JSON)) to cater to various research needs. MASAQ's unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, which governs its use and distribution. It has been created in compliance with ethical guidelines and with respect for the integrity of the Quranic text.http://www.sciencedirect.com/science/article/pii/S2352340924011739Syntactic annotationmorphological annotationsyntactic relationssemantic relationsi'rab إعراب (ʾi‘rāb)analysis |
spellingShingle | Majdi Sawalha Faisal Al-Shargi Sane Yagi Abdallah T. AlShdaifat Bassam Hammo Mariam Belajeed Lubna R. Al-Ogaili Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data Data in Brief Syntactic annotation morphological annotation syntactic relations semantic relations i'rab إعراب (ʾi‘rāb) analysis |
title | Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data |
title_full | Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data |
title_fullStr | Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data |
title_full_unstemmed | Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data |
title_short | Morphologically-analyzed and syntactically-annotated Quran datasetMendeley Data |
title_sort | morphologically analyzed and syntactically annotated quran datasetmendeley data |
topic | Syntactic annotation morphological annotation syntactic relations semantic relations i'rab إعراب (ʾi‘rāb) analysis |
url | http://www.sciencedirect.com/science/article/pii/S2352340924011739 |
work_keys_str_mv | AT majdisawalha morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT faisalalshargi morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT saneyagi morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT abdallahtalshdaifat morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT bassamhammo morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT mariambelajeed morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata AT lubnaralogaili morphologicallyanalyzedandsyntacticallyannotatedqurandatasetmendeleydata |