A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features

This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time...

Full description

Saved in:

Bibliographic Details
Main Authors:	Muhammad Talha, Huma Ghafoor, Seung Yeob Nam
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features
Online Access:	https://ieeexplore.ieee.org/document/11098792/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.
ISSN:	2169-3536

A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features

Similar Items