An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation
The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/25/2/341 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832587551699894272 |
---|---|
author | Yi Qin Feifan Yu |
author_facet | Yi Qin Feifan Yu |
author_sort | Yi Qin |
collection | DOAJ |
description | The coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results. |
format | Article |
id | doaj-art-22feb00802b94886a2e9eaafe5fa251a |
institution | Kabale University |
issn | 1424-8220 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj-art-22feb00802b94886a2e9eaafe5fa251a2025-01-24T13:48:34ZengMDPI AGSensors1424-82202025-01-0125234110.3390/s25020341An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and EvaluationYi Qin0Feifan Yu1College of Computer Science & Technology, Xi’an University of Science and Technology, Xi’an 710054, ChinaSHCCIG Yubei Coal Industry Co., Ltd., Xi’an 710900, ChinaThe coal mining industry in Northern Shaanxi is robust, with a prevalent use of the local dialect, known as “Shapu”, characterized by a distinct Northern Shaanxi accent. This study addresses the practical need for speech recognition in this dialect. We propose an end-to-end speech recognition model for the North Shaanxi dialect, leveraging the Conformer architecture. To tailor the model to the coal mining context, we developed a specialized corpus reflecting the phonetic characteristics of the dialect and its usage in the industry. We investigated feature extraction techniques suitable for the North Shaanxi dialect, focusing on the unique pronunciation of initial consonants and vowels. A preprocessing module was designed to accommodate the dialect’s rapid speech tempo and polyphonic nature, enhancing recognition performance. To enhance the decoder’s text generation capability, we replaced the Conformer decoder with a Transformer architecture. Additionally, to mitigate the computational demands of the model, we incorporated Connectionist Temporal Classification (CTC) joint training for optimization. The experimental results on our self-established voice dataset for the Northern Shaanxi coal mining industry demonstrate that the proposed Conformer–Transformer–CTC model achieves a 9.2% and 10.3% reduction in the word error rate compared to the standalone Conformer and Transformer models, respectively, confirming the advancement of our method. The next step will involve researching how to improve the performance of dialect speech recognition by integrating external language models and extracting pronunciation features of different dialects, thereby achieving better recognition results.https://www.mdpi.com/1424-8220/25/2/341dialect speech recognitioncoal mining industryend to endConformer modelTransformer modelConnectionist Temporal Classification (CTC) |
spellingShingle | Yi Qin Feifan Yu An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation Sensors dialect speech recognition coal mining industry end to end Conformer model Transformer model Connectionist Temporal Classification (CTC) |
title | An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation |
title_full | An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation |
title_fullStr | An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation |
title_full_unstemmed | An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation |
title_short | An End-To-End Speech Recognition Model for the North Shaanxi Dialect: Design and Evaluation |
title_sort | end to end speech recognition model for the north shaanxi dialect design and evaluation |
topic | dialect speech recognition coal mining industry end to end Conformer model Transformer model Connectionist Temporal Classification (CTC) |
url | https://www.mdpi.com/1424-8220/25/2/341 |
work_keys_str_mv | AT yiqin anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT feifanyu anendtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT yiqin endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation AT feifanyu endtoendspeechrecognitionmodelforthenorthshaanxidialectdesignandevaluation |