Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano
The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Open Journal of the Communications Society |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10810367/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850146202364936192 |
|---|---|
| author | Mohammed Antara Labiba Swapnil Marilyn Dip Peris Istiaque Hasan Nihal Riasat Khan Mohammad Abdul Matin |
| author_facet | Mohammed Antara Labiba Swapnil Marilyn Dip Peris Istiaque Hasan Nihal Riasat Khan Mohammad Abdul Matin |
| author_sort | Mohammed |
| collection | DOAJ |
| description | The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically recognize and categorize suspicious occurrences. Apart from the visual data, a multimodal approach has been implemented by integrating audio data from the RLVS dataset. The audio classification was done using the VGGish and Wav2Vec 2.0 models. Various pre-trained and vision transformer-based networks have been used for the video dataset. The VGGish and MobileViT models have been combined for both auditory and visual modalities. With multimodal VGGish + MobileViT, the classification accuracy and F1 score have been enhanced to 97.13% and 0.97, respectively. The knowledge distillation technique has been employed by transferring the backbone knowledge from a fine-tuned ViT model (teacher) to a MobileViT (student), focusing on training only the task head of the student model. Finally, the proposed distilled MobileViT model has been implemented in a Jetson Nano edge device for immediate identification at an average frame rate of 5–10 frames per second. The experiments demonstrate that the multimodal technique provides higher accuracy and robustness, confirming its effectiveness for real-time monitoring of the Jetson Nano and producing a user-friendly surveillance system. |
| format | Article |
| id | doaj-art-57a2a6a9b621447aa94e1333f10f3e9d |
| institution | OA Journals |
| issn | 2644-125X |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Open Journal of the Communications Society |
| spelling | doaj-art-57a2a6a9b621447aa94e1333f10f3e9d2025-08-20T02:27:54ZengIEEEIEEE Open Journal of the Communications Society2644-125X2025-01-0162907292510.1109/OJCOMS.2024.352070310810367Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano Mohammed0https://orcid.org/0009-0005-4330-8114Antara Labiba Swapnil1https://orcid.org/0009-0009-8102-2036Marilyn Dip Peris2https://orcid.org/0009-0004-0856-5249Istiaque Hasan Nihal3https://orcid.org/0009-0001-1114-4726Riasat Khan4https://orcid.org/0000-0002-5429-2235Mohammad Abdul Matin5https://orcid.org/0000-0001-9312-4122Electrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshElectrical and Computer Engineering, North South University, Dhaka, BangladeshThe need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically recognize and categorize suspicious occurrences. Apart from the visual data, a multimodal approach has been implemented by integrating audio data from the RLVS dataset. The audio classification was done using the VGGish and Wav2Vec 2.0 models. Various pre-trained and vision transformer-based networks have been used for the video dataset. The VGGish and MobileViT models have been combined for both auditory and visual modalities. With multimodal VGGish + MobileViT, the classification accuracy and F1 score have been enhanced to 97.13% and 0.97, respectively. The knowledge distillation technique has been employed by transferring the backbone knowledge from a fine-tuned ViT model (teacher) to a MobileViT (student), focusing on training only the task head of the student model. Finally, the proposed distilled MobileViT model has been implemented in a Jetson Nano edge device for immediate identification at an average frame rate of 5–10 frames per second. The experiments demonstrate that the multimodal technique provides higher accuracy and robustness, confirming its effectiveness for real-time monitoring of the Jetson Nano and producing a user-friendly surveillance system.https://ieeexplore.ieee.org/document/10810367/Deep learningknowledge distillationmultimodal learningsurveillance systemsviolence detectionvision transformer |
| spellingShingle | Mohammed Antara Labiba Swapnil Marilyn Dip Peris Istiaque Hasan Nihal Riasat Khan Mohammad Abdul Matin Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano IEEE Open Journal of the Communications Society Deep learning knowledge distillation multimodal learning surveillance systems violence detection vision transformer |
| title | Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano |
| title_full | Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano |
| title_fullStr | Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano |
| title_full_unstemmed | Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano |
| title_short | Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano |
| title_sort | multimodal deep learning for violence detection vggish and mobilevit integration with knowledge distillation on jetson nano |
| topic | Deep learning knowledge distillation multimodal learning surveillance systems violence detection vision transformer |
| url | https://ieeexplore.ieee.org/document/10810367/ |
| work_keys_str_mv | AT mohammed multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano AT antaralabibaswapnil multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano AT marilyndipperis multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano AT istiaquehasannihal multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano AT riasatkhan multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano AT mohammadabdulmatin multimodaldeeplearningforviolencedetectionvggishandmobilevitintegrationwithknowledgedistillationonjetsonnano |