From Pixels to Words: ResNet–LSTM Based Image Captioning with Greedy and Beam Search
- DOI
- 10.2991/978-94-6463-978-0_27How to use a DOI?
- Keywords
- Image Captioning; Encoder–Decoder; Convolutional Neural Networks; Long Short-Term Memory (LSTM); Beam Search; Transfer Learning
- Abstract
Image captioning bridges visual natural language generation, letting systems craft textual descriptions of pictures. This ability fuels an array of uses—technology, content retrieval and human-computer interaction, among them. Progress has been propelled by datasets such as Flickr8k, Flickr30k and Visual Genome, with MS COCO now standing as the facto benchmark. In this study we assembled a CNN-LSTM model using ResNet-50 as the encoder and an LSTM, as the decoder. The network was trained with a cross-entropy loss, optimized by Adam with learning rates. Kept stable through teacher forcing and gradient clipping. During inference we compared decoding against a beam search, with a beam size of five. When we run the COCO suite (BLEU, METEOR, ROUGE-L, CIDEr) it’s clear that decoding homes in on unigram precision whereas beam search boosts n-gram matches albeit, at the expense of consensus.
Peering into the results one can see the model’s knack, for pinpointing objects with accuracy yet its attempts at captioning still betray a lack of fluency and diversity. This duality does not reaffirm CNN–LSTM frameworks as baseline contenders but also throws a spotlight on the pressing need, for attention-driven modules, reinforcement-learning strategies and transformer-based designs if richer semantic grounding and more natural language generation are to be achieved. The source code and model weights are available at: https://github.com/muralikrishnasn/NaturalImageCaption
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - V. S. Shrishma Rao AU - S. N. Muralikrishna AU - Poornima Shetty AU - Aruna Doreen Manezes PY - 2025 DA - 2025/12/31 TI - From Pixels to Words: ResNet–LSTM Based Image Captioning with Greedy and Beam Search BT - Proceedings of the 1st Engineering Data Analytics and Management Conference (EAMCON 2025) PB - Atlantis Press SP - 293 EP - 304 SN - 2352-5401 UR - https://doi.org/10.2991/978-94-6463-978-0_27 DO - 10.2991/978-94-6463-978-0_27 ID - Rao2025 ER -