Brain-inspired multimodal motion and fine-grained action recognition

IntroductionTraditional action recognition methods predominantly rely on a single modality, such as vision or motion, which presents significant limitations when dealing with fine-grained action recognition. These methods struggle particularly with video data containing complex combinations of actio...

Full description

Saved in:
Bibliographic Details
Main Authors: Yuening Li, Xiuhua Yang, Changkui Chen
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-01-01
Series:Frontiers in Neurorobotics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fnbot.2024.1502071/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589828489740288
author Yuening Li
Xiuhua Yang
Changkui Chen
author_facet Yuening Li
Xiuhua Yang
Changkui Chen
author_sort Yuening Li
collection DOAJ
description IntroductionTraditional action recognition methods predominantly rely on a single modality, such as vision or motion, which presents significant limitations when dealing with fine-grained action recognition. These methods struggle particularly with video data containing complex combinations of actions and subtle motion variations.MethodsTypically, they depend on handcrafted feature extractors or simple convolutional neural network (CNN) architectures, which makes effective multimodal fusion challenging. This study introduces a novel architecture called FGM-CLIP (Fine-Grained Motion CLIP) to enhance fine-grained action recognition. FGM-CLIP leverages the powerful capabilities of Contrastive Language-Image Pretraining (CLIP), integrating a fine-grained motion encoder and a multimodal fusion layer to achieve precise end-to-end action recognition. By jointly optimizing visual and motion features, the model captures subtle action variations, resulting in higher classification accuracy in complex video data.Results and discussionExperimental results demonstrate that FGM-CLIP significantly outperforms existing methods on multiple fine-grained action recognition datasets. Its multimodal fusion strategy notably improves the model's robustness and accuracy, particularly for videos with intricate action patterns.
format Article
id doaj-art-36661640b19742d5868c68f3b2c4c0d7
institution Kabale University
issn 1662-5218
language English
publishDate 2025-01-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Neurorobotics
spelling doaj-art-36661640b19742d5868c68f3b2c4c0d72025-01-24T07:13:57ZengFrontiers Media S.A.Frontiers in Neurorobotics1662-52182025-01-011810.3389/fnbot.2024.15020711502071Brain-inspired multimodal motion and fine-grained action recognitionYuening Li0Xiuhua Yang1Changkui Chen2Wuhan Sports University, Wuhan, ChinaWuhan Sports University, Wuhan, ChinaSchool of Physical Education and Training, Party School of Shandong Provincial Committee of the Communist Party of China (Shandong Administrative Institute), Jinan, Shandong, ChinaIntroductionTraditional action recognition methods predominantly rely on a single modality, such as vision or motion, which presents significant limitations when dealing with fine-grained action recognition. These methods struggle particularly with video data containing complex combinations of actions and subtle motion variations.MethodsTypically, they depend on handcrafted feature extractors or simple convolutional neural network (CNN) architectures, which makes effective multimodal fusion challenging. This study introduces a novel architecture called FGM-CLIP (Fine-Grained Motion CLIP) to enhance fine-grained action recognition. FGM-CLIP leverages the powerful capabilities of Contrastive Language-Image Pretraining (CLIP), integrating a fine-grained motion encoder and a multimodal fusion layer to achieve precise end-to-end action recognition. By jointly optimizing visual and motion features, the model captures subtle action variations, resulting in higher classification accuracy in complex video data.Results and discussionExperimental results demonstrate that FGM-CLIP significantly outperforms existing methods on multiple fine-grained action recognition datasets. Its multimodal fusion strategy notably improves the model's robustness and accuracy, particularly for videos with intricate action patterns.https://www.frontiersin.org/articles/10.3389/fnbot.2024.1502071/fullbrain-inspiredmultimodalaction recognitionCLIPclustering algorithms
spellingShingle Yuening Li
Xiuhua Yang
Changkui Chen
Brain-inspired multimodal motion and fine-grained action recognition
Frontiers in Neurorobotics
brain-inspired
multimodal
action recognition
CLIP
clustering algorithms
title Brain-inspired multimodal motion and fine-grained action recognition
title_full Brain-inspired multimodal motion and fine-grained action recognition
title_fullStr Brain-inspired multimodal motion and fine-grained action recognition
title_full_unstemmed Brain-inspired multimodal motion and fine-grained action recognition
title_short Brain-inspired multimodal motion and fine-grained action recognition
title_sort brain inspired multimodal motion and fine grained action recognition
topic brain-inspired
multimodal
action recognition
CLIP
clustering algorithms
url https://www.frontiersin.org/articles/10.3389/fnbot.2024.1502071/full
work_keys_str_mv AT yueningli braininspiredmultimodalmotionandfinegrainedactionrecognition
AT xiuhuayang braininspiredmultimodalmotionandfinegrainedactionrecognition
AT changkuichen braininspiredmultimodalmotionandfinegrainedactionrecognition