Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method

In the recent years, along with the development of artificial intelligence (AI) and man-machine interaction technology, speech recognition and production have been asked to adapt to the rapid development of AI and man-machine technology, which need to improve recognition accuracy through adding nove...

Full description

Saved in:

Bibliographic Details
Main Authors:	Guofeng Ren, Guicheng Shao, Jianmei Fu
Format:	Article
Language:	English
Published:	Wiley 2020-01-01
Series:	Complexity
Online Access:	http://dx.doi.org/10.1155/2020/4356981
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832566362214498304
author	Guofeng Ren Guicheng Shao Jianmei Fu
author_facet	Guofeng Ren Guicheng Shao Jianmei Fu
author_sort	Guofeng Ren
collection	DOAJ
description	In the recent years, along with the development of artificial intelligence (AI) and man-machine interaction technology, speech recognition and production have been asked to adapt to the rapid development of AI and man-machine technology, which need to improve recognition accuracy through adding novel features, fusing the feature, and improving recognition methods. Aiming at developing novel recognition feature and application to speech recognition, this paper presents a new method for articulatory-to-acoustic conversion. In the study, we have converted articulatory features (i.e., velocities of tongue and motion of lips) into acoustic features (i.e., the second formant and Mel-Cepstra). By considering the graphical representation of the articulators’ motion, this study combined Bidirectional Long Short-Term Memory (BiLSTM) with convolution neural network (CNN) and adopted the idea of word attention in Mandarin to extract semantic features. In this paper, we used the electromagnetic articulography (EMA) database designed by Taiyuan University of Technology, which contains ten speakers’ 299 disyllables and sentences of Mandarin, and extracted 8-dimensional articulatory features and 1-dimensional semantic feature relying on the word-attention layer; we then trained 200 samples and tested 99 samples for the articulatory-to-acoustic conversion. Finally, Root Mean Square Error (RMSE), Mean Mel-Cepstral Distortion (MMCD), and correlation coefficient have been used to evaluate the conversion effect and for comparison with Gaussian Mixture Model (GMM) and BiLSTM of recurrent neural network (BiLSTM-RNN). The results illustrated that the MMCD of Mel-Frequency Cepstrum Coefficient (MFCC) was 1.467 dB, and the RMSE of F2 was 22.10 Hz. The research results of this study can be used in the features fusion and speech recognition to improve the accuracy of recognition.
format	Article
id	doaj-art-b53d7c5ab166490a89c843c48308bef7
institution	Kabale University
issn	1076-2787 1099-0526
language	English
publishDate	2020-01-01
publisher	Wiley
record_format	Article
series	Complexity
spelling	doaj-art-b53d7c5ab166490a89c843c48308bef72025-02-03T01:04:27ZengWileyComplexity1076-27871099-05262020-01-01202010.1155/2020/43569814356981Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based MethodGuofeng Ren0Guicheng Shao1Jianmei Fu2Department of Electronics, Xinzhou Teachers University, Xinzhou 034000, ChinaDepartment of Electronics, Xinzhou Teachers University, Xinzhou 034000, ChinaDepartment of Electronics, Xinzhou Teachers University, Xinzhou 034000, ChinaIn the recent years, along with the development of artificial intelligence (AI) and man-machine interaction technology, speech recognition and production have been asked to adapt to the rapid development of AI and man-machine technology, which need to improve recognition accuracy through adding novel features, fusing the feature, and improving recognition methods. Aiming at developing novel recognition feature and application to speech recognition, this paper presents a new method for articulatory-to-acoustic conversion. In the study, we have converted articulatory features (i.e., velocities of tongue and motion of lips) into acoustic features (i.e., the second formant and Mel-Cepstra). By considering the graphical representation of the articulators’ motion, this study combined Bidirectional Long Short-Term Memory (BiLSTM) with convolution neural network (CNN) and adopted the idea of word attention in Mandarin to extract semantic features. In this paper, we used the electromagnetic articulography (EMA) database designed by Taiyuan University of Technology, which contains ten speakers’ 299 disyllables and sentences of Mandarin, and extracted 8-dimensional articulatory features and 1-dimensional semantic feature relying on the word-attention layer; we then trained 200 samples and tested 99 samples for the articulatory-to-acoustic conversion. Finally, Root Mean Square Error (RMSE), Mean Mel-Cepstral Distortion (MMCD), and correlation coefficient have been used to evaluate the conversion effect and for comparison with Gaussian Mixture Model (GMM) and BiLSTM of recurrent neural network (BiLSTM-RNN). The results illustrated that the MMCD of Mel-Frequency Cepstrum Coefficient (MFCC) was 1.467 dB, and the RMSE of F2 was 22.10 Hz. The research results of this study can be used in the features fusion and speech recognition to improve the accuracy of recognition.http://dx.doi.org/10.1155/2020/4356981
spellingShingle	Guofeng Ren Guicheng Shao Jianmei Fu Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method Complexity
title	Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method
title_full	Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method
title_fullStr	Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method
title_full_unstemmed	Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method
title_short	Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method
title_sort	articulatory to acoustic conversion using bilstm cnn word attention based method
url	http://dx.doi.org/10.1155/2020/4356981
work_keys_str_mv	AT guofengren articulatorytoacousticconversionusingbilstmcnnwordattentionbasedmethod AT guichengshao articulatorytoacousticconversionusingbilstmcnnwordattentionbasedmethod AT jianmeifu articulatorytoacousticconversionusingbilstmcnnwordattentionbasedmethod

Articulatory-to-Acoustic Conversion Using BiLSTM-CNN Word-Attention-Based Method

Similar Items