Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano

The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohammed, Antara Labiba Swapnil, Marilyn Dip Peris, Istiaque Hasan Nihal, Riasat Khan, Mohammad Abdul Matin
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of the Communications Society
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10810367/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850146202364936192
author Mohammed
Antara Labiba Swapnil
Marilyn Dip Peris
Istiaque Hasan Nihal
Riasat Khan
Mohammad Abdul Matin
author_facet Mohammed
Antara Labiba Swapnil
Marilyn Dip Peris
Istiaque Hasan Nihal
Riasat Khan
Mohammad Abdul Matin
author_sort Mohammed
collection DOAJ
description The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically recognize and categorize suspicious occurrences. Apart from the visual data, a multimodal approach has been implemented by integrating audio data from the RLVS dataset. The audio classification was done using the VGGish and Wav2Vec 2.0 models. Various pre-trained and vision transformer-based networks have been used for the video dataset. The VGGish and MobileViT models have been combined for both auditory and visual modalities. With multimodal VGGish + MobileViT, the classification accuracy and F1 score have been enhanced to 97.13% and 0.97, respectively. The knowledge distillation technique has been employed by transferring the backbone knowledge from a fine-tuned ViT model (teacher) to a MobileViT (student), focusing on training only the task head of the student model. Finally, the proposed distilled MobileViT model has been implemented in a Jetson Nano edge device for immediate identification at an average frame rate of 5–10 frames per second. The experiments demonstrate that the multimodal technique provides higher accuracy and robustness, confirming its effectiveness for real-time monitoring of the Jetson Nano and producing a user-friendly surveillance system.
format Article
id doaj-art-57a2a6a9b621447aa94e1333f10f3e9d
institution OA Journals
issn 2644-125X
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Open Journal of the Communications Society
spelling doaj-art-57a2a6a9b621447aa94e1333f10f3e9d2025-08-20T02:27:54ZengIEEEIEEE Open Journal of the Communications Society2644-125X2025-01-0162907292510.1109/OJCOMS.2024.352070310810367Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano Mohammed0https://orcid.org/0009-0005-4330-8114Antara Labiba Swapnil1https://orcid.org/0009-0009-8102-2036Marilyn Dip Peris2https://orcid.org/0009-0004-0856-5249Istiaque Hasan Nihal3https://orcid.org/0009-0001-1114-4726Riasat Khan4https://orcid.org/0000-0002-5429-2235Mohammad Abdul Matin5https://orcid.org/0000-0001-9312-4122Electrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshThe need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically recognize and categorize suspicious occurrences. Apart from the visual data, a multimodal approach has been implemented by integrating audio data from the RLVS dataset. The audio classification was done using the VGGish and Wav2Vec 2.0 models. Various pre-trained and vision transformer-based networks have been used for the video dataset. The VGGish and MobileViT models have been combined for both auditory and visual modalities. With multimodal VGGish + MobileViT, the classification accuracy and F1 score have been enhanced to 97.13% and 0.97, respectively. The knowledge distillation technique has been employed by transferring the backbone knowledge from a fine-tuned ViT model (teacher) to a MobileViT (student), focusing on training only the task head of the student model. Finally, the proposed distilled MobileViT model has been implemented in a Jetson Nano edge device for immediate identification at an average frame rate of 5–10 frames per second. The experiments demonstrate that the multimodal technique provides higher accuracy and robustness, confirming its effectiveness for real-time monitoring of the Jetson Nano and producing a user-friendly surveillance system.https://ieeexplore.ieee.org/document/10810367/Deep learningknowledge distillationmultimodal learningsurveillance systemsviolence detectionvision transformer
spellingShingle Mohammed
Antara Labiba Swapnil
Marilyn Dip Peris
Istiaque Hasan Nihal
Riasat Khan
Mohammad Abdul Matin
Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
IEEE Open Journal of the Communications Society
Deep learning
knowledge distillation
multimodal learning
surveillance systems
violence detection
vision transformer
title Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
title_full Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
title_fullStr Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
title_full_unstemmed Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
title_short Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
title_sort multimodal deep learning for violence detection vggish and mobilevit integration with knowledge distillation on jetson nano
topic Deep learning
knowledge distillation
multimodal learning
surveillance systems
violence detection
vision transformer
url https://ieeexplore.ieee.org/document/10810367/
work_keys_str_mv AT mohammed multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano
AT antaralabibaswapnil multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano
AT marilyndipperis multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano
AT istiaquehasannihal multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano
AT riasatkhan multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano
AT mohammadabdulmatin multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano