Semantic-Based Classification of Long Texts on Higher Education in China

The development level of higher education (HE) is an important indicator of the development level and development potential of a country. The HE-related document is the mirror to reflect the develop process of the HE. The research of high education (HE) has been developing rapidly in China, resultin...

Full description

Saved in:
Bibliographic Details
Main Authors: Chun Li, Yanying Fei
Format: Article
Language:English
Published: Wiley 2021-01-01
Series:Discrete Dynamics in Nature and Society
Online Access:http://dx.doi.org/10.1155/2021/9237713
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832561397619228672
author Chun Li
Yanying Fei
author_facet Chun Li
Yanying Fei
author_sort Chun Li
collection DOAJ
description The development level of higher education (HE) is an important indicator of the development level and development potential of a country. The HE-related document is the mirror to reflect the develop process of the HE. The research of high education (HE) has been developing rapidly in China, resulting in a huge number of texts, such as relevant policies, speech drafts, and yearbooks. The traditional manual classification of HE texts is inefficient and unable to deal with the huge number of HE texts. Besides, the effect of direct classification is rather poor because HE texts tend to be long and exist as an imbalanced dataset. To solve these problems, this paper improves the convolutional neural network (CNN) into the HE-CNN classification model for HE texts. Firstly, Chinese HE policies, speech drafts, and yearbooks (1979–2020) were downloaded from the official website of Chinese Ministry of Education. In total, 463 files were collected and divided into four classes, namely, definition, task, method, and effect evaluation. To handle the huge number of HE texts, the Twitter-latent Dirichlet allocation (LDA) topic model was employed to extract word frequency and critical information, such as age and author, enhancing the training effect of CNN. To address the dataset imbalance problem, CNN parameters were optimized repeatedly through comparative experiments, which further improve the training effect. Finally, the proposed HE-CNN model was found more effective and accurate than other classification models.
format Article
id doaj-art-f34531ff189042f7a4b302fb3b41441a
institution Kabale University
issn 1026-0226
1607-887X
language English
publishDate 2021-01-01
publisher Wiley
record_format Article
series Discrete Dynamics in Nature and Society
spelling doaj-art-f34531ff189042f7a4b302fb3b41441a2025-02-03T01:25:10ZengWileyDiscrete Dynamics in Nature and Society1026-02261607-887X2021-01-01202110.1155/2021/92377139237713Semantic-Based Classification of Long Texts on Higher Education in ChinaChun Li0Yanying Fei1School of Marxism, Dalian University of Technology, Dalian 116023, ChinaFaculty of Humanities and Social Sciences, Dalian University of Technology, Dalian 116086, ChinaThe development level of higher education (HE) is an important indicator of the development level and development potential of a country. The HE-related document is the mirror to reflect the develop process of the HE. The research of high education (HE) has been developing rapidly in China, resulting in a huge number of texts, such as relevant policies, speech drafts, and yearbooks. The traditional manual classification of HE texts is inefficient and unable to deal with the huge number of HE texts. Besides, the effect of direct classification is rather poor because HE texts tend to be long and exist as an imbalanced dataset. To solve these problems, this paper improves the convolutional neural network (CNN) into the HE-CNN classification model for HE texts. Firstly, Chinese HE policies, speech drafts, and yearbooks (1979–2020) were downloaded from the official website of Chinese Ministry of Education. In total, 463 files were collected and divided into four classes, namely, definition, task, method, and effect evaluation. To handle the huge number of HE texts, the Twitter-latent Dirichlet allocation (LDA) topic model was employed to extract word frequency and critical information, such as age and author, enhancing the training effect of CNN. To address the dataset imbalance problem, CNN parameters were optimized repeatedly through comparative experiments, which further improve the training effect. Finally, the proposed HE-CNN model was found more effective and accurate than other classification models.http://dx.doi.org/10.1155/2021/9237713
spellingShingle Chun Li
Yanying Fei
Semantic-Based Classification of Long Texts on Higher Education in China
Discrete Dynamics in Nature and Society
title Semantic-Based Classification of Long Texts on Higher Education in China
title_full Semantic-Based Classification of Long Texts on Higher Education in China
title_fullStr Semantic-Based Classification of Long Texts on Higher Education in China
title_full_unstemmed Semantic-Based Classification of Long Texts on Higher Education in China
title_short Semantic-Based Classification of Long Texts on Higher Education in China
title_sort semantic based classification of long texts on higher education in china
url http://dx.doi.org/10.1155/2021/9237713
work_keys_str_mv AT chunli semanticbasedclassificationoflongtextsonhighereducationinchina
AT yanyingfei semanticbasedclassificationoflongtextsonhighereducationinchina