Multimodal Question Answering: Method Evolution, Challenges and Prospects

Haopeng Li

doi:10.2991/978-94-6239-648-7_38

<Previous Article In Volume

Next Article In Volume>

Multimodal Question Answering: Method Evolution, Challenges and Prospects

Authors

Haopeng Li¹^{, *}

¹Information Science and Technology College, Dalian Maritime University, Dalian, Liaoning, China

^*Corresponding author. Email: gbp5003@gmail.com

Corresponding Author

Haopeng Li

Available Online 24 April 2026.

DOI: 10.2991/978-94-6239-648-7_38 How to use a DOI?
Keywords: Multimodal Question Answering; Visual Question Answering; Cross-modal Fusion; Pre-trained Model
Abstract: With the breakthroughs in cross-modal technology of artificial intelligence, multi-modal question answering (MMQA), as a key research direction connecting image, text and voice information, has increasingly significant application value in fields such as barrier-free services and education. This paper systematically reviews from three dimensions: task classification, core model methods and experimental performance, and focuses on analyzing the technical paths of typical tasks such as visual question answering, voice VQA and image-to-voice QA. It also summarizes the innovative mechanisms and practical effects of pre-trained models like BLIP-2 and GIT in cross-modal representation and semantic under-standing. By comparing the dataset adaptability and evaluation results of different tasks, the study reveals the bottlenecks in the current technology in terms of semantic alignment quality and output naturalness. This analysis provides theoretical and practical references for the technical optimization and scenario expansion in the MMQA field. Future research can focus on improving the deep alignment of cross-modal semantics and the naturalness of generation, while exploring more general and adaptable frameworks to promote the in-depth application and innovative development of MMQA technology in complex real-world scenarios.
Copyright: © 2026 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Workshop on Advances in Deep Learning for Image Analysis and Computer Vision (IWADIC 2025)
Series: Advances in Computer Science Research
Publication Date: 24 April 2026
ISBN: 978-94-6239-648-7
ISSN: 2352-538X
DOI: 10.2991/978-94-6239-648-7_38 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Haopeng Li
PY  - 2026
DA  - 2026/04/24
TI  - Multimodal Question Answering: Method Evolution, Challenges and Prospects
BT  - Proceedings of the International Workshop on Advances in Deep Learning for Image Analysis and Computer Vision (IWADIC 2025)
PB  - Atlantis Press
SP  - 347
EP  - 354
SN  - 2352-538X
UR  - https://doi.org/10.2991/978-94-6239-648-7_38
DO  - 10.2991/978-94-6239-648-7_38
ID  - Li2026
ER  -

download .riscopy to clipboard