LLM-driven multimodal video-text fusion for isolated sign language recognition

Please login to bookmark

Sign languages are the primary means of communication for deaf communities, but the development of effective automatic recognition systems remains a significant challenge. In this work, we focus on the task of Isolated Sign Language Recognition (ISLR) using a multimodal approach grounded in a Large Language Model (LLM) architecture. We merge modalities, including visual characteristics into the linguistic representation space of LLMs, and perform ablation studies to evaluate the individual contributions of each visual modality to the recognition performance. Experiments are conducted on the AVASAG100 dataset, where our method achieves a weighted F1-score (W-F1) of 70.36±3.00 and a macro F1-score (MF1) of 62.34±3.18 projecting landmarks extracted from the pose into the LLM’s emebdding-space. These results underscore the value of multimodal integration in ISLR and provide guidelines for future research directions.

LLM-driven multimodal video-text fusion for isolated sign language recognition

Continuar buscando...

Nueva Información Actualizada

Related posts: