Multimodal Translation with Lip Movements, Audio, and Subtitles
- DOI
- 10.2991/978-2-38476-523-2_21How to use a DOI?
- Keywords
- Multimodal Translation; Visual Speech Recognition; Cross-modal Alignment
- Abstract
Multimodal translation has emerged as a promising approach to enhance the quality of language translation by incorporating non-textual information such as visual and auditory cues. However, existing methods often rely on limited modality pairs (e.g., speech-text or image-text) and overlook the rich, complementary information embedded in real-world video content. This paper presents a novel tri-modal translation approach that jointly leverages lip movements (visual modality), raw speech audio (auditory modality), and subtitle text (linguistic modality) to improve video-to-text translation performance. To address temporal misalignment across modalities, a cross-modal alignment transformer is introduced to synchronize token-level features. A dynamic modality-gating mechanism is proposed to adaptively weight each modality based on contextual reliability, enabling robust translation even in the presence of noise or occlusion. Furthermore, a prompt-conditioned decoder is designed to guide the generation process using structured natural language instructions derived from all modalities. To enhance semantic alignment prior to fine-tuning, contrastive multimodal pretraining is applied using triplet loss over positive and negative modality pairs. Experiments on publicly available video translation benchmarks demonstrate that the proposed system significantly outperforms strong baselines in terms of BLEU, METEOR, and human evaluation. The results suggest that integrating visual, auditory, and linguistic signals in a unified, prompt-aware architecture offers a powerful strategy for real-world multimodal translation.
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Jianing Li PY - 2025 DA - 2025/12/29 TI - Multimodal Translation with Lip Movements, Audio, and Subtitles BT - Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025) PB - Atlantis Press SP - 197 EP - 205 SN - 2352-5398 UR - https://doi.org/10.2991/978-2-38476-523-2_21 DO - 10.2991/978-2-38476-523-2_21 ID - Li2025 ER -