Spectrogram Features-Based Automatic Speaker Identification For Smart Services
Automatic speaker identification (ASI) is an exciting area of research with numerous applications such as surveillance, voice authentication, identity verification, and electronic voice eavesdropping. This study investigates ASI based on features derived from spectrogram images through a convolution...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2025-12-01
|
| Series: | Applied Artificial Intelligence |
| Online Access: | https://www.tandfonline.com/doi/10.1080/08839514.2025.2459476 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Automatic speaker identification (ASI) is an exciting area of research with numerous applications such as surveillance, voice authentication, identity verification, and electronic voice eavesdropping. This study investigates ASI based on features derived from spectrogram images through a convolution neural network (CNN) with rectangular-shaped kernels. Traditionally, CNN employs square-shaped kernel and max-pooling operations at different layers, a design optimized to handle 2D data. Nevertheless, encoding of information differs slightly to deal with spectrograms. The frequency is displayed along the y-axis, and the x-axis presents the time of the audio. Amplitude is denoted by intensity within the spectrogram image at certain point. The main contributions of this study are 1: To analyze audio signals effectively using spectrograms, this study proposed the utilization of spectrogram features with different sizes and shapes of rectangular kernels to derive distinctive features by improving the recognition accuracy of the speaker identification system. 2. The extracted spectrogram-based features and models are evaluated on the ELSDSR, TSP, and LibriSpeech datasets and achieved the weighted accuracy of 96.0%, 99.2%, and 97.6%, respectively. 3. The proposed rectangular-shaped CNN approach effectively derives suitable features from spectrogram images and outperformed several baseline techniques when performance was assessed on ELSDSR, TSP, and LibriSpeech datasets. |
|---|---|
| ISSN: | 0883-9514 1087-6545 |