A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction

Please login to bookmark

Video memorability prediction has emerged as a key challenge for improving information retrieval, content design, and user engagement. Prior work has shown that semantic cues play a crucial role in determining memorability, with recent studies leveraging Contrastive Language-Image Pre-training (CLIP) encoders to incorporate semantic information. However, the specific improvements attributable to CLIP models remain unclear, as few studies systematically compare their performance against equivalent unimodal encoders or explore fine-tuning strategies. This work addresses that gap through a comprehensive, controlled evaluation of CLIP-based and unimodal encoders for video memorability prediction. We propose FCLIP, a domain-adapted extension of CLIP that undergoes additional contrastive pre-training on memorability-specific image-text pairs. Our experiments assess both feature extraction and supervised fine-tuning, ensuring fair comparisons across models with matched architecture and parameter count. Results show that FCLIP image encoders achieve a Spearman Rank Correlation Coefficient (SRCC) of 0.672 on the Memento10k dataset, significantly outperforming unimodal Vision Transformers. FCLIP text encoders similarly outperform unimodal baselines, reaching an SRCC of 0.632. These findings demonstrate that contrastive learning and domain adaptation substantially improve memorability prediction, highlighting the importance of semantic and multimodal pre-training in developing advanced content analysis systems.

A comprehensive study on contrastive pre-training and fine tuning of vision and text transformers for video memorability prediction

Continuar buscando...

Nueva Información Actualizada

Related posts: