Multimodal Question Answering: Method Evolution, Challenges and Prospects
- DOI
- 10.2991/978-94-6239-648-7_38How to use a DOI?
- Keywords
- Multimodal Question Answering; Visual Question Answering; Cross-modal Fusion; Pre-trained Model
- Abstract
With the breakthroughs in cross-modal technology of artificial intelligence, multi-modal question answering (MMQA), as a key research direction connecting image, text and voice information, has increasingly significant application value in fields such as barrier-free services and education. This paper systematically reviews from three dimensions: task classification, core model methods and experimental performance, and focuses on analyzing the technical paths of typical tasks such as visual question answering, voice VQA and image-to-voice QA. It also summarizes the innovative mechanisms and practical effects of pre-trained models like BLIP-2 and GIT in cross-modal representation and semantic under-standing. By comparing the dataset adaptability and evaluation results of different tasks, the study reveals the bottlenecks in the current technology in terms of semantic alignment quality and output naturalness. This analysis provides theoretical and practical references for the technical optimization and scenario expansion in the MMQA field. Future research can focus on improving the deep alignment of cross-modal semantics and the naturalness of generation, while exploring more general and adaptable frameworks to promote the in-depth application and innovative development of MMQA technology in complex real-world scenarios.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Haopeng Li PY - 2026 DA - 2026/04/24 TI - Multimodal Question Answering: Method Evolution, Challenges and Prospects BT - Proceedings of the International Workshop on Advances in Deep Learning for Image Analysis and Computer Vision (IWADIC 2025) PB - Atlantis Press SP - 347 EP - 354 SN - 2352-538X UR - https://doi.org/10.2991/978-94-6239-648-7_38 DO - 10.2991/978-94-6239-648-7_38 ID - Li2026 ER -