A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11098792/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets. |
|---|---|
| ISSN: | 2169-3536 |