Incorporating Attention Mechanism and Word Embeddings for Generating Image Captions

Desiree D'Mello; Garima Bajwa

doi:10.15353/jcvis.v9i1.10009

Vol. 9 No. 1 (2023)
Special Issue: Proceedings of CVIS 2023

Articles

Incorporating Attention Mechanism and Word Embeddings for Generating Image Captions

https://doi.org/10.15353/jcvis.v9i1.10009

Published 2024-04-30

Desiree D'Mello
Garima Bajwa

How to Cite

D'Mello, D., & Bajwa, G. (2024). Incorporating Attention Mechanism and Word Embeddings for Generating Image Captions. Journal of Computational Vision and Imaging Systems, 9(1), 34–37. https://doi.org/10.15353/jcvis.v9i1.10009

Download Citation

Abstract

The main goal of image captioning, a combination of computer vision and NLP, is to provide interpretations of the image in the form of meaningful captions in an automated manner without human intervention. This work provides insight into utilizing a soft-attention mechanism that enables a model to understand how to generate descriptive captions automatically. We considered two approaches - word embeddings trained from scratch and pre-trained GloVe word embeddings to understand if pre-trained vector representations help achieve more meaningful and correct caption expressions than vector representations trained from scratch. This study used visualization to demonstrate how the attention model could concentrate on critical elements of the image while producing the words that corresponded in the output sequence. The research visually represents the captions created by the word embeddings trained from scratch and the pre-trained GloVe embeddings. Evaluation using standard BLEU metrics has demonstrated that our technique significantly enhances model performance.

PDF