The main goal of image captioning, a combination of computer vision and NLP, is to provide interpretations of the image in the form of meaningful captions in an automated manner without human intervention. This work provides insight into utilizing a soft-attention mechanism that enables a model to understand how to generate descriptive captions automatically. We considered two approaches - word embeddings trained from scratch and pre-trained GloVe word embeddings to understand if pre-trained vector representations help achieve more meaningful and correct caption expressions than vector representations trained from scratch. This study used visualization to demonstrate how the attention model could concentrate on critical elements of the image while producing the words that corresponded in the output sequence. The research visually represents the captions created by the word embeddings trained from scratch and the pre-trained GloVe embeddings. Evaluation using standard BLEU metrics has demonstrated that our technique significantly enhances model performance.
PDF