Hierarchical multi‐modal video summarization with dynamic sampling

Abstract Previous video summarization methods often neglected inter‐frame variations during the preprocessing stage. Sampling repeated frames can lead to information redundancy, while missing key frames can result in deviations in semantic comprehension and inaccuracies in the generated summaries. T...

Full description

Saved in:
Bibliographic Details
Main Authors: Lingjian Yu, Xing Zhao, Liang Xie, Haoran Liang, Ronghua Liang
Format: Article
Language:English
Published: Wiley 2024-12-01
Series:IET Image Processing
Subjects:
Online Access:https://doi.org/10.1049/ipr2.13269
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Previous video summarization methods often neglected inter‐frame variations during the preprocessing stage. Sampling repeated frames can lead to information redundancy, while missing key frames can result in deviations in semantic comprehension and inaccuracies in the generated summaries. This work proposes a dynamic sampling module that leverages frame‐level motion information to alleviate these issues. The module conducts high‐frequency sampling during intervals with significant changes, allowing for a finer capture of details. Combined with a hierarchical multi‐modal structure, it integrates shot‐level visual and textual information to enhance the semantic understanding of video clips and improve the accuracy of the summarized content. Extensive experiments on benchmark datasets SumMe and TVSum demonstrate the effectiveness of the proposed method.
ISSN:1751-9659
1751-9667