The image captioning task is regarded as an active problem in computer vision research. The focal aim of this task is to generate accurate descriptions for the content of an input image. The state-of-the-art automatic image captioningsystems typically rely on the Encoder-Decoder architecture and only a few among which exploit the features resulting fromthe object detection task in order to maximize the accuracy of captions generation. In this work we introduce a new attentionguided Encoder-Decoder based captioning approach that utilizes two types of features: a) deep visual features extracted from EfficientNetV2 pre-trained model on ImageNet dataset, and b) object features extracted from YOLOv7 pre-trained model on MSCOCO dataset. Additionally, we compute a new object features-driven feature called Priority Factor as a utility to rank objects based on their prominence in input images. The proposed approach is evaluated using a famous dataset in computer vision (MSCOCO). The empirical performance of our approach is measured using eight metrics, and the results elaborate the effectiveness of adding our schema (Priority Factor) to object features, leading to minor improvements in evaluation metrics BLEU-1, BLEU-2, BLEU-3, and BLEU-4. This approach outperforms four state of the art approaches in various evaluation metrics such as BLEU-1, BLEU-2, BLEU-3, and SPICE.

1.
P.
Anderson
et al,
"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2018
), pp.
6077
6086
.
2.
I.
Sutskever
,
O.
Vinyals
, and
Q. V.
Le
,
"Sequence to sequence learning with neural networks
," in
Proceedings of the 27th International Conference on Neural Information Processing Systems
,(
2014
), pp.
3104
3112
.
3.
S.
Hochreiter
and
J.
Schmidhuber
,
Neural Comput.
9
,
1735
1780
(
1997
).
4.
K.
Cho
,
B.
van Merrienboer
,
C.
Gulcehre
,
F.
Bougares
,
H.
Schwenk
, and
Y.
Bengio
,
"Learning phrase representations using RNN encoder-decoder for statistical machine translation
," in
Conference on Empirical Methods in Natural Language Processing
(
EMNLP 2014
), (
2014
).
5.
J.
Johnson
,
A.
Karpathy
, and
L.
Fei-Fei
,
"DenseCap: Fully convolutional localization networks for dense captioning
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2016
), pp.
4565
4574
.
6.
M.
Tan
and
Q. V.
Le
,
"EfficientNetV2: Smaller Models and Faster Training
," in
International conference on machine learning
, (
2021
), pp.
10096
10106
.
7.
J.
Deng
,
W.
Dong
,
R.
Socher
,
L.-J.
Li
,
K.
Li
, and
L.
Fei-Fei
,
"ImageNet: A large-scale hierarchical image database
," in
2009 IEEE Conference on Computer Vision and Pattern Recognition
, (
2009
), pp.
248
255
.
8.
C.-Y.
Wang
,
A.
Bochkovskiy
, and
H.-Y. M.
Liao
,
"YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
," in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR
), (
2022
), pp.
7464
7475
.
9.
T. Y.
Lin
et al,
"Microsoft COCO: Common objects in context
," in
Computer Vision--ECCV 2014: 13th European Conference
, (
Zurich, Switzerland
, September,
2014
), pp.
740
755
.
10.
Y.
Lecun
,
L.
Bottou
,
Y.
Bengio
, and
P.
Haffner
,
Proc. IEEE
86
,
2278
2324
(
1998
).
11.
O.
Vinyals
,
A.
Toshev
,
S.
Bengio
, and
D.
Erhan
,
"Show and tell: A neural image caption generator
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2015
), pp.
3156
3164
.
12.
X.
Jia
,
E.
Gavves
,
B.
Fernando
, and
T.
Tuytelaars
,
"Guiding the long-short term memory model for image captiongeneration
," in
Proceedings of the IEEE International Conference on Computer Vision
, (
2015
), pp.
2407
2415
.
13.
C.
Wang
,
H.
Yang
,
C.
Bartz
, and
C.
Meinel
,
"Image captioning with deep bidirectional LSTMs
," in
MM 2016 -Proceedings of the 2016 ACM Multimedia Conference
, (
2016
), pp.
988
997
.
14.
K.
Xu
et al,
"Show, attend and tell: Neural image caption generation with visual attention
," in
32nd International Conference on Machine Learning, ICML 2015, in Proceedings of Machine Learning Research
. (
2015
), pp.
2048
2057
.
15.
R.
Girshick
,
J.
Donahue
,
T.
Darrell
, and
J.
Malik
,
"Rich feature hierarchies for accurate object detection and semantic segmentation
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2014
), pp.
580
587
.
16.
S.
Ren
,
K.
He
,
R.
Girshick
, and
J.
Sun
,
IEEE Trans. Pattern Anal. & Mach. Intell.
39
,
1137
1149
(
2017
).
17.
Z.
Yang
,
Y.-J.
Zhang
,
S. ur
Rehman
, and
Y.
Huang
,
"Image captioning with object detection and localization
," in
Image and Graphics: 9th International Conference
,
ICIG
2017
, (
2017
), pp.
109
118
.
18.
X.
Yin
and
V.
Ordonez
,
"OBJ2TEXT: Generating visually descriptive language from object layouts
," in
EMNLP2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
, (
2017
), pp.
177
187
.
19.
J.
Redmon
and
A.
Farhadi
,
"YOLO9000: Better, Faster, Stronger
," in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR
), (
2017
), pp.
6517
6525
.
20.
K.
Simonyan
and
A.
Zisserman
,
"Very deep convolutional networks for large-scale image recognition
," in
3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings
, (
2015
), pp.
1
14
.
21.
K.
Iwamura
,
J. Y. L.
Kasahara
,
A.
Moro
,
A.
Yamashita
, and
H.
Asama
,
"Potential of Incorporating Motion Estimation for Image Captioning
," in
2021 IEEE/SICE International Symposium on System Integration (SII
),
IEEE
, (
2021
), pp.
23
28
.
22.
J.
Xu
,
T.
Mei
,
T.
Yao
, and
Y.
Rui
,
"MSR-VTT: A large video description dataset for bridging video and language
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2016
), pp.
5288
5296
.
23.
K.
Papineni
,
S.
Roukos
,
T.
Ward
, and
W. J.
Zhu
,
"BLEU: A method for automatic evaluation of machine translation
," in
Proceedings of the Annual Meeting of the Association for Computational Linguistics
, (
2002
), pp.
311
318
.
24.
S.
Banerjee
and
A.
Lavie
,
"METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
," in
Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
, (
2005
), pp.
65
72
.
25.
M. A.
Al-Malla
,
A.
Jafar
, and
N.
Ghneim
,
J. Big Data
9
,
20
(
2022
).
26.
F.
Chollet
,
"Xception: Deep learning with depthwise separable convolutions
," in
Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition
,
CVPR
2017
, (
2017
), pp.
1800
1807
.
27.
S.
Sumathi
,
A. M.
Prasad
, and
V.
Suma
,
"Optimized heap sort technique (OHS) to enhance the performance of the heap sort by using two-swap method
," in
Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA
), (
2015
), pp.
693
700
.
28.
C.
Zhang
,
S.
Bengio
,
M.
Hardt
,
B.
Recht
, and
O.
Vinyals
,
Commun. ACM
64
,
107
115
( 2021).
29.
M.
Becker
,
J.
Lippel
,
A.
Stuhlsatz
, and
T.
Zielke
,
"Robust dimensionality reduction for data visualization withdeep neural networks
,"
Graph. Models
108
,
101060
(
2020
).
30.
N.
Srivastava
,
G.
Hinton
,
A.
Krizhevsky
,
I.
Sutskever
, and
R.
Salakhutdinov
,
J. Mach. Learn. Res.
15
,
1929
1958
(
2014
).
31.
Z.
Niu
,
G.
Zhong
, and
H.
Yu
,
Neurocomputing
452
,
48
62
(
2021
).
32.
D.
Bahdanau
,
K. H.
Cho
, and
Y.
Bengio
,
"Neural machine translation by jointly learning to align and translate
," in
3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings
, (
2015
), pp.
1
15
.
33.
A.
Graves
, in Supervised Sequence Labelling with Recurrent Neural Networks,
Springer Berlin
Heidelberg
,
37
45
(
2012
).
34.
Y.
Yu
,
X.
Si
,
C.
Hu
, and
J.
Zhang
,
Neural Comput.
31
,
1235
1270
(
2019
).
35.
A.
Karpathy
and
L.
Fei-Fei
,
"Deep visual-semantic alignments for generating image descriptions
," in
2015 IEEEConference on Computer Vision and Pattern Recognition (CVPR
), (
2015
), pp.
3128
3137
.
36.
R.
Vedantam
,
C. L.
Zitnick
, and
D.
Parikh
,
"CIDEr: Consensus-based image description evaluation
," in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, (
2015
), pp.
4566
4575
.
37.
L.
CY
,
"Rouge: a package for automatic evaluation of summaries
,"
Text Summ. branches out
,
74
81
(
2004
).
38.
P.
Anderson
,
B.
Fernando
,
M.
Johnson
, and
S.
Gould
,
"Spice: Semantic propositional image caption evaluation
," in
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings
,
Part V
14
, (
Springer
,
2016
), pp.
382
398
.
39.
D. P.
Kingma
and
J. L.
Ba
,
"Adam: A method for stochastic optimization
," in
3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings
, (
2015
), pp.
1
15
.
40.
F. M.
Suchanek
,
G.
Kasneci
, &
G. W.
Yago
:
A large ontology from wikipedia and wordnet
.
Journal of Web Semantics
6
,
203
217
(
2008
).
41.
S.
Decker
,
S.
Melnik
,
F.
Van Harmelen
,
D.
Fensel
, M.
Klein
,
J. Broekstra
. &
I.
Horrocks
,
IEEE Internet computing
4
,
63
73
(
2000
).
42.
D. L.
McGuinness
, &
D. L.
Van Harmelen
, (2004),
W3C recommendation
10
, (
2004
).
This content is only available via PDF.
You do not currently have access to this content.