Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025)

Multimodal Translation with Lip Movements, Audio, and Subtitles

Authors
Jianing Li1, *
1Guangxi University, Nanning, Guangxi, 530004, China
*Corresponding author. Email: 2405391011@st.gxu.edu.com
Corresponding Author
Jianing Li
Available Online 29 December 2025.
DOI
10.2991/978-2-38476-523-2_21How to use a DOI?
Keywords
Multimodal Translation; Visual Speech Recognition; Cross-modal Alignment
Abstract

Multimodal translation has emerged as a promising approach to enhance the quality of language translation by incorporating non-textual information such as visual and auditory cues. However, existing methods often rely on limited modality pairs (e.g., speech-text or image-text) and overlook the rich, complementary information embedded in real-world video content. This paper presents a novel tri-modal translation approach that jointly leverages lip movements (visual modality), raw speech audio (auditory modality), and subtitle text (linguistic modality) to improve video-to-text translation performance. To address temporal misalignment across modalities, a cross-modal alignment transformer is introduced to synchronize token-level features. A dynamic modality-gating mechanism is proposed to adaptively weight each modality based on contextual reliability, enabling robust translation even in the presence of noise or occlusion. Furthermore, a prompt-conditioned decoder is designed to guide the generation process using structured natural language instructions derived from all modalities. To enhance semantic alignment prior to fine-tuning, contrastive multimodal pretraining is applied using triplet loss over positive and negative modality pairs. Experiments on publicly available video translation benchmarks demonstrate that the proposed system significantly outperforms strong baselines in terms of BLEU, METEOR, and human evaluation. The results suggest that integrating visual, auditory, and linguistic signals in a unified, prompt-aware architecture offers a powerful strategy for real-world multimodal translation.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025)
Series
Advances in Social Science, Education and Humanities Research
Publication Date
29 December 2025
ISBN
978-2-38476-523-2
ISSN
2352-5398
DOI
10.2991/978-2-38476-523-2_21How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Jianing Li
PY  - 2025
DA  - 2025/12/29
TI  - Multimodal Translation with Lip Movements, Audio, and Subtitles
BT  - Proceedings of the 5th International Conference on New Media Development and Modernised Education (NMDME 2025)
PB  - Atlantis Press
SP  - 197
EP  - 205
SN  - 2352-5398
UR  - https://doi.org/10.2991/978-2-38476-523-2_21
DO  - 10.2991/978-2-38476-523-2_21
ID  - Li2025
ER  -