Vision-to-Voice: Transforming Comics into Immersive Audiobooks with BLIP and OCR

Paron Asib; V. Harish Rosan; V. Ram Prashanth; R. P. Saurjyesh  Karthikeyan; T. Anusha

doi:10.2991/978-94-6463-866-0_24

<Previous Article In Volume

Next Article In Volume>

Vision-to-Voice: Transforming Comics into Immersive Audiobooks with BLIP and OCR

Authors

Paron Asib¹, V. Harish Rosan¹, V. Ram Prashanth¹, R. P. Saurjyesh Karthikeyan¹, T. Anusha¹^{, *}

¹Department of Computer Science and Engineering, College of Engineering and Technology, SRM Institute of Science and Technology, Vadapalani Campus, No. 1, Jawaharlal Nehru Road, Vadapalani, Chennai, Tamil Nadu, India

^*Corresponding author. Email: anushat@srmist.edu.in

Corresponding Author

T. Anusha

Available Online 31 October 2025.

DOI: 10.2991/978-94-6463-866-0_24 How to use a DOI?
Keywords: Bootstrapped Language-Image Pretraining (BLIP); Feature Extraction; Optical Character Recognition (OCR); Text-To-Speech (TTS); Natural Language Processing (NLP); Google Text-to-Speech (gTTS)
Abstract: The Vision-to-Voice system transforms the comic books stories into engaging audiobooks. The system uses Bootstrapped Language-Image Pretraining (BLIP) for feature extraction and Optical Character Recognition (OCR) for text extraction. Major visual aspects and text in comic panels are detected and translated into English text then natural speech through Google Text-to-Speech (TTS). Trained on big data such as MS COCO and tested with BLEU, METEOR, and CIDEr, the model guarantees good-quality captions and scene narration. The method facilitates accessibility, where the user can access comics during travel, under low-light settings, or in the absence of visual fatigue. Importantly, it enables the visually impaired to enjoy comic narration by opening up an auditory doorway to comic narration, closing the visual-verbal narrative gap. By transforming printed comics into immersive audio experiences, this system promotes digital accessibility in education and entertainment, laying the groundwork for future multimedia access breakthroughs and advancing the appreciation of visual narratives.
Copyright: © 2025 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Conference on Intelligent Systems and Digital Transformation (ICISD 2025)
Series: Atlantis Highlights in Intelligent Systems
Publication Date: 31 October 2025
ISBN: 978-94-6463-866-0
ISSN: 2589-4919
DOI: 10.2991/978-94-6463-866-0_24 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Paron Asib
AU  -  V. Harish Rosan
AU  - V. Ram Prashanth
AU  - R. P. Saurjyesh  Karthikeyan
AU  - T. Anusha
PY  - 2025
DA  - 2025/10/31
TI  - Vision-to-Voice: Transforming Comics into Immersive Audiobooks with BLIP and OCR
BT  - Proceedings of the International Conference on Intelligent Systems and Digital Transformation (ICISD 2025)
PB  - Atlantis Press
SP  - 273
EP  - 285
SN  - 2589-4919
UR  - https://doi.org/10.2991/978-94-6463-866-0_24
DO  - 10.2991/978-94-6463-866-0_24
ID  - Asib2025
ER  -

download .riscopy to clipboard