An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yi Qin, Feifan Yu
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Sensors
Subjects:	dialect speech recognition coal mining industry end to end Conformer model Transformer model Connectionist Temporal Classification (CTC)
Online Access:	https://www.mdpi.com/1424-8220/25/2/341
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832587551699894272
author	Yi Qin Feifan Yu
author_facet	Yi Qin Feifan Yu
author_sort	Yi Qin
collection	DOAJ
description	The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results.
format	Article
id	doaj-art-22feb00802b94886a2e9eaafe5fa251a
institution	Kabale University
issn	1424-8220
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-22feb00802b94886a2e9eaafe5fa251a2025-01-24T13:48:34ZengMDPI AGSensors1424-82202025-01-0125234110.3390/s25020341An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and EvaluationYi Qin0Feifan Yu1College of Computer Science & Technology, Xi’an University of Science and Technology, Xi’an 710054, ChinaSHCCIG Yubei Coal Industry Co., Ltd., Xi’an 710900, ChinaThe coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results.https://www.mdpi.com/1424-8220/25/2/341dialect speech recognitioncoal mining industryend to endConformer modelTransformer modelConnectionist Temporal Classification (CTC)
spellingShingle	Yi Qin Feifan Yu An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation Sensors dialect speech recognition coal mining industry end to end Conformer model Transformer model Connectionist Temporal Classification (CTC)
title	An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_full	An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_fullStr	An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_full_unstemmed	An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_short	An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_sort	end to end speech recognition model for the north shaanxi dialect design and evaluation
topic	dialect speech recognition coal mining industry end to end Conformer model Transformer model Connectionist Temporal Classification (CTC)
url	https://www.mdpi.com/1424-8220/25/2/341
work_keys_str_mv	AT yiqin anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT feifanyu anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT yiqin endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT feifanyu endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation

An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

Similar Items