Multimodal Translation with Lip Movements, Audio, and Subtitles

Jianing Li

doi:10.2991/978-2-38476-523-2_21

<Previous Article In Volume

Next Article In Volume>

Multimodal Translation with Lip Movements, Audio, and Subtitles

Authors

Jianing Li¹^{, *}

¹Guangxi University, Nanning, Guangxi, 530004, China

^*Corresponding author. Email: 2405391011@st.gxu.edu.com

Corresponding Author

Jianing Li

Available Online 29 December 2025.

DOI: 10.2991/978-2-38476-523-2_21 How to use a DOI?
Keywords: Multimodal Translation; Visual Speech Recognition; Cross-modal Alignment
Abstract: Multimodal translation has emerged as a promising approach to enhance the quality of language translation by incorporating non-textual information such as visual and auditory cues. However, existing methods often rely on limited modality pairs (e.g., speech-text or image-text) and overlook the rich, complementary information embedded in real-world video content. This paper presents a novel tri-modal translation approach that jointly leverages lip movements (visual modality), raw speech audio (auditory modality), and subtitle text (linguistic modality) to improve video-to-text translation performance. To address temporal misalignment across modalities, a cross-modal alignment transformer is introduced to synchronize token-level features. A dynamic modality-gating mechanism is proposed to adaptively weight each modality based on contextual reliability, enabling robust translation even in the presence of noise or occlusion. Furthermore, a prompt-conditioned decoder is designed to guide the generation process using structured natural language instructions derived from all modalities. To enhance semantic alignment prior to fine-tuning, contrastive multimodal pretraining is applied using triplet loss over positive and negative modality pairs. Experiments on publicly available video translation benchmarks demonstrate that the proposed system significantly outperforms strong baselines in terms of BLEU, METEOR, and human evaluation. The results suggest that integrating visual, auditory, and linguistic signals in a unified, prompt-aware architecture offers a powerful strategy for real-world multimodal translation.
Copyright: © 2025 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025)
Series: Advances in Social Science, Education and Humanities Research
Publication Date: 29 December 2025
ISBN: 978-2-38476-523-2
ISSN: 2352-5398
DOI: 10.2991/978-2-38476-523-2_21 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Jianing Li
PY  - 2025
DA  - 2025/12/29
TI  - Multimodal Translation with Lip Movements, Audio, and Subtitles
BT  - Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025)
PB  - Atlantis Press
SP  - 197
EP  - 205
SN  - 2352-5398
UR  - https://doi.org/10.2991/978-2-38476-523-2_21
DO  - 10.2991/978-2-38476-523-2_21
ID  - Li2025
ER  -

download .riscopy to clipboard