The automatic prediction of image captions is a very challenging task in natural language processing (NLP). Many studies have employed convolutional neural networks as encoders and decoders. Nevertheless, to accurately predict image captions, a model must comprehend the semantic relationships among the numerous objects available in the given image. Attention-based mechanism carries out a linear grouping of encoder and decoder state operations. It places equal emphasis on the semantic information that is found in the caption as well as the visual knowledge that is contained within a given image. In this research paper, we integrated the local attention approach with two pre-trained convolutional neural networks (CNN) known as VGG19 and Inception_V3 in order to provide the textual description of any given image. These models are employed as an encoder, while the recurrent neural network serves as the decoder in the system. Together with the attention mechanism, these encoders are capable of transmitting the semantic-context knowledge to the decoder and achieved the BLUE Score of 64.7.

1.
Zheng
,
Z.
,
Zheng
,
L.
,
Garrett
,
M.
,
Yang
,
Y.
, &
Shen
,
Y.
(
2017
).
Dual-Path Convolutional Image-Text Embedding
.
ArXiv
, abs/1711.05535.
2.
Farhadi
,
A.
,
Hejrati
,
M.
,
Sadeghi
,
M.A.
,
Young
,
P.
,
Rashtchian
,
C.
,
Hockenmaier
,
J.
, &
Forsyth
,
D.A.
(
2010
).
Every Picture Tells a Story: Generating Sentences from Images
.
European Conference on Computer Vision.
3.
Hochreiter
,
S.
, &
Schmidhuber
,
J.
(
1997
).
Long Short-Term Memory
.
Neural Computation
,
9
,
1735
1780
.
4.
Cho
,
K.
,
Merrienboer
,
B.V.
,
Gülçehre
,
Ç.
,
Bahdanau
,
D.
,
Bougares
,
F.
,
Schwenk
,
H.
, &
Bengio
,
Y.
(
2014
).
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
.
Conference on Empirical Methods in Natural Language Processing.
5.
Xu
,
K.
,
Ba
,
J.
,
Kiros
,
R.
,
Cho
,
K.
,
Courville
,
A.C.
,
Salakhutdinov
,
R.
,
Zemel
,
R.S.
, &
Bengio
,
Y.
(
2015
).
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
.
International Conference on Machine Learning.
6.
Wu
,
A.
,
Han
,
Y.
,
Zhao
,
Z.
, &
Yang
,
Y.
(
2021
).
Hierarchical Memory Decoder for Visual Narrating
.
IEEE Transactions on Circuits and Systems for Video Technology
,
31
,
2438
2449
.
7.
Wang
,
C.
,
Yang
,
H.
, &
Meinel
,
C.
(
2018
).
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning
.
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)
,
14
,
1
20
.
8.
Aneja
,
J.
,
Deshpande
,
A.
, &
Schwing
,
A.G.
(
2017
).
Convolutional Image Captioning
.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
5561
5570
.
9.
Bahdanau
,
D.
,
Cho
,
K.
, &
Bengio
,
Y.
(
2014
).
Neural Machine Translation by Jointly Learning to Align and Translate
.
CoRR
, abs/1409.0473.
10.
Lu
,
J.
,
Xiong
,
C.
,
Parikh
,
D.
, &
Socher
,
R.
(
2016
).
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
,
3242
3250
.
11.
Jaiswal
,
T.
(
2021
).
Image Captioning through Cognitive IOT and Machine-Learning Approaches.
12.
Yin
,
X.
, &
Ordonez
,
V.
(
2017
).
Obj2Text: Generating Visually Descriptive Language from Object Layouts
.
Conference on Empirical Methods in Natural Language Processing.
13.
Jaiswal
,
T.
,
Pandey
,
M.
, &
Tripathi
,
P.
(
2022
).
Real Time Multiple-Object Detection Based On Enhanced SSD
.
2022 Second International Conference on Power, Control and Computing Technologies (ICPC2T)
,
1
5
.
14.
Anderson
,
P.
,
He
,
X.
,
Buehler
,
C.
,
Teney
,
D.
,
Johnson
,
M.
,
Gould
,
S.
, &
Zhang
,
L.
(
2017
).
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
6077
6086
.
15.
Zhang
,
Z.
,
Wu
,
Q.
,
Wang
,
Y.
, &
Chen
,
F.
(
2021
).
Exploring region relationships implicitly: Image captioning with visual relationship attention
.
Image Vis. Comput.
,
109
,
104146
.
16.
Chen
,
F.
, &
Jahanshahi
,
M.R.
(
2020
).
NB-FCN: Real-Time Accurate Crack Detection in Inspection Videos Using Deep Fully Convolutional Network and Parametric Data Fusion
.
IEEE Transactions on Instrumentation and Measurement
,
69
,
5325
5334
.
17.
Huang
,
L.
,
Wang
,
W.
,
Chen
,
J.
, &
Wei
,
X.
(
2019
).
Attention on Attention for Image Captioning
.
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
,
4633
4642
.
18.
Hodosh
,
M.
,
Young
,
P.
, &
Hockenmaier
,
J.
(
2013
).
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract
).
J. Artif. Intell. Res.
,
47
,
853
899
.
19.
Khan
,
R.
,
Islam
,
M.S.
,
Kanwal
,
K.
,
Iqbal
,
M.
,
Hossain
,
M.I.
, &
Ye
,
Z.F.
(
2022
).
A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism
.
ArXiv
, abs/2203.01594.
20.
Hani
,
A.
,
Tagougui
,
N.
, &
Kherallah
,
M.
(
2019
).
Image Caption Generation Using A Deep Architecture
.
2019 International Arab Conference on Information Technology (ACIT)
,
246
251
.
21.
Hossain
,
M.Z.
,
Sohel
,
F.
,
Shiratuddin
,
M.F.
, &
Laga
,
H.
(
2018
).
A Comprehensive Survey of Deep Learning for Image Captioning
.
ACM Computing Surveys (CSUR)
,
51
,
1
36
.
22.
Meshram
,
S.T.
(
2019
).
Survey on Attention Neural Network Models for Natural Language Processing.
23.
Bhat
,
S.
,
Naik
,
S.
,
Gaonkar
,
M.
,
Sawant
,
P.
,
Aswale
,
S.
, &
Shetgaonkar
,
P.R.
(
2021
).
Road crack detection using convolutional neural network
.
Indian journal of science and technology
,
14
,
881
891
.
24.
Lavanya
,
P.
, &
Sasikala
,
E.
(
2021
).
Deep Learning Techniques on Text Classification Using Natural Language Processing (NLP) In Social Healthcare Network: A Comprehensive Survey
.
2021 3rd International Conference on Signal Processing and Communication (ICPSC)
,
603
609
.
25.
Song
,
J.
,
Kim
,
S.
, &
Yoon
,
S.
(
2021
).
AligNART: Non-autoregressive Neural Machine Translation by Jointly Learning to Estimate Alignment and Translate
.
ArXiv
, abs/2109.06481.
This content is only available via PDF.
You do not currently have access to this content.