An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation

The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model...

Full description

Saved in:
Bibliographic Details
Main Authors: Yi Qin, Feifan Yu
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/2/341
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832587551699894272
author Yi Qin
Feifan Yu
author_facet Yi Qin
Feifan Yu
author_sort Yi Qin
collection DOAJ
description The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results.
format Article
id doaj-art-22feb00802b94886a2e9eaafe5fa251a
institution Kabale University
issn 1424-8220
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-22feb00802b94886a2e9eaafe5fa251a2025-01-24T13:48:34ZengMDPI AGSensors1424-82202025-01-0125234110.3390/s25020341An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and EvaluationYi Qin0Feifan Yu1College of Computer Science & Technology, Xi’an University of Science and Technology, Xi’an 710054, ChinaSHCCIG Yubei Coal Industry Co., Ltd., Xi’an 710900, ChinaThe coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results.https://www.mdpi.com/1424-8220/25/2/341dialect speech recognitioncoal mining industryend to endConformer modelTransformer modelConnectionist Temporal Classification (CTC)
spellingShingle Yi Qin
Feifan Yu
An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
Sensors
dialect speech recognition
coal mining industry
end to end
Conformer model
Transformer model
Connectionist Temporal Classification (CTC)
title An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_full An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_fullStr An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_full_unstemmed An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_short An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
title_sort end to end speech recognition model for the north shaanxi dialect design and evaluation
topic dialect speech recognition
coal mining industry
end to end
Conformer model
Transformer model
Connectionist Temporal Classification (CTC)
url https://www.mdpi.com/1424-8220/25/2/341
work_keys_str_mv AT yiqin anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation
AT feifanyu anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation
AT yiqin endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation
AT feifanyu endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation