(Received: 26-Oct.-2020, Revised: 15-Dec.-2020 and 17-Jan.-2021 , Accepted: 2-Feb.-2021)
A Deep Learning (DL) algorithm is highly common and attractive in recent years because of its encouraging achievements in many areas. DL lies in image-based detection and instance segmentation of an entity, which is a critical issue that needs further investigation. This paper aims to study the fundamental challenges in using object instance segmentation of images. This paper proposes a novel algorithm for multi-object image instance segmentation algorithm in three stages. A novel backbone approach improves the image recognition algorithm by extracting low and high characteristic levels from the given images in the first stage. The ResNet is the fundamental building block and connects with the Squeeze-and-Excitation Network (SENet) for each ResNet block. The Region Proposal Network (RPN) is used to determine the object item’s placement, followed by the third stage, which suggests an average position RoI layer to choose the optimal boundaries of the instance segmentation. The experiments are conducted and validated using a standard benchmark image dataset, called COCO. The proposed algorithm’s performance is validated using standard evaluation criteria and compared against the recent image segmentation algorithms that use object instances. The results show that the proposed algorithm gets better results than other well-known instance segmentation algorithms in terms of average accuracy over IoU (AP) threshold measures using various thresholds.

[1] Q. Zhang, L. T. Yang, Z. Chen and P. Li, "A Survey on Deep Learning for Big Data," Information Fusion, vol. 42, pp. 146–157, 2018.

[2] L. Liu, W. Ouyang et al., "Deep Learning for Generic Object Detection: A Survey," International Journal of Computer Vision, vol. 128, pp. 261–318, 2020.

[3] N. F. F. Alshdaifat, A. Z. Talib and M. A. Osman, "Improved Deep Learning Framework for Fish Segmentation in Underwater Videos," Ecological Informatics, vol. 59, p. 101121, DOI: 10.1016/j.ecoinf.2020.101121 2020.

[4] Z.-C. He, L.-Y. An et al., "Comment on “Deep Learning Computer Vision Algorithm for Detecting Kidney Stone Composition”," World Journal of Urology, DOI: 10.1007/s00345-020-03181-4, April 2020.

[5] A. B. Nassif, I. Shahin et al., "Speech Recognition Using Deep Neural Networks: A Systematic Review," IEEE Access, vol. 7, pp. 19143–19165, 2019.

[6] J. Jiang and H. H. Wang, "Application Intelligent Search and Recommendation System Based on Speech Recognition Technology," International Journal of Speech Technology, pp. 1–8, DOI: 10.1007/s10772- 020-09703-0, April 2020.

[7] M. Zhou, X. Wei, S. Kwong et al., "Rate Control Method Based on Deep Reinforcement Learning for Dynamic Video Sequences in HEVC," IEEE Transactions on Multimedia, pp. 1-1, DOI: 10.1109/TMM.2020.2992968, May 2020.

[8] S. F. A. Abuowaida and H. Y. Chan, "Improved Deep Learning Architecture for Depth Estimation from Single Image," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 6, no. 4, pp. 434-445, 2020. 

[9] R. S. T. Lee, "Natural Language Processing," in: Artificial Intelligence in Daily Life Book, pp. 157–192, ISBN 978-981-15-7695-9, Springer, 2020.

[10] K. Shuang, Z. Zhang et al., "Convolution–deconvolution Word Embedding: An End-to-end Multi- prototype Fusion Embedding Method for Natural Language Processing," Information Fusion, vol. 53, pp. 112–122, DOI: 10.1016/j.inffus.2019.06.009, 2020.

[11] Spyridon Thermos et al., "Deep Sensorimotor Learning for RGB-D Object Recognition," Computer Vision and Image Understanding, vol. 190, p. 102844, DOI: 10.1016/j.cviu.2019.102844, 2020.

[12] N. Wang, Y. Wang and M. J. Er, "Review on Deep Learning Techniques for Marine Object Recognition: Architectures and Algorithms," Control Engineering Practice, p. 104458, DOI: 10.1016/j.conengprac.2020.104458, 2020.

[13] Qiaoyong Zhong et al., "Cascade Region Proposal and Global Context for Deep Object Detection," Neurocomputing, vol. 395, pp. 170–177, 2020.

[14] Francisco Pérez-Hernández et al., "Object Detection Binary Classifiers Methodology Based on Deep Learning to Identify Small Objects Handled Similarly: Application in Video Surveillance," Knowledge- based Systems, vol. 194, p. 105590, DOI: 10.1016/j.knosys.2020.105590, 2020.

[15] M. Rezaei, H. Yang and C. Meinel, "Recurrent Generative Adversarial Network for Learning Imbalanced Medical Image Semantic Segmentation," Multimedia Tools and Applications, vol. 79, pp. 15329–15348, DOI: 10.1007/s11042-019-7305-1, 2020.

[16] B. Xu, W. Wang, G. Valzon et al., "Automated Cattle Counting Using Mask R-CNN in Quadcopter Vision System," Computers and Electronics in Agriculture, vol. 171, p. 105300, 2020.

[17] M. Bellver, A. Salvador, J. Torres et al., "Mask-guided Sample Selection for Semi-supervised Instance Segmentation," Multimedia Tools and Applications, vol. 79, pp. 25551–25569, DOI: 10.1007/s11042- 020-09235-4, 2020.

[18] D. Larlus, J. Verbeek and F. Jurie, "Category Level Object Segmentation by Combining Bag-of-words Models with Dirichlet Processes and Random Fields," International Journal of Computer Vision, vol. 88, pp. 238–253, DOI: 10.1007/s11263-009-0245-x, 2010.

[19] X. Zhao, Y. Satoh et al., "Object Detection Based on a Robust and Accurate Statistical Multipoint-pair Model," Pattern Recognition, vol. 44, no. 6, pp. 1296–1311, 2011.

[20] J. Walsh, N. O’Mahony et al., "Deep Learning vs. Traditional Computer Vision," Proc. of the Science and Information Conference (CVC), pp. 128–144, DOI: 10.1007/978-3-030-17795-9_10, Springer, Las Vegas, USA, 2019.

[21] Z. Xue, D. Ming et al., "Infrared Gait Recognition Based on Wavelet Transform and Support Vector Machine," Pattern Recognition, vol. 43, no. 8, pp. 2904–2910, DOI: 10.1016/j.patcog.2010.03.011, 2010.

[22] R. Girshick, J. Donahue et al., "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, DOI: 10.1109/CVPR.2014.81, Columbus, USA, 2014.

[23] R. Girshick, "Fast R-CNN," Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448, DOI: 10.1109/ICCV.2015.169, Santiago, Chile, 2015.

[24] S. Ren, K. He et al., "Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems, pp. 91–99, [Online], Available:, 2015.

[25] Yi Li et al., "Fully Convolutional Instance-aware Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367, DOI: 10.1109/CVPR.2017.472, Honolulu, USA, 2017.

[26] J. Dai, K. He and J. Sun, "Instance-aware Semantic Segmentation via Multi-task Network Cascades," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3150– 3158, DOI: 10.1109/CVPR.2016.343, Las Vegas, 2016.

[27] K. He et al., "Mask R-CNN," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969, DOI: 10.1109/ICCV.2017.322, Venice, Italy, 2017.

[28] D. Bolya, C. Zhou et al., "YOLACT: Real-time Instance Segmentation," Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9157–9166, DOI: 10.1109/ICCV.2019.00925, Seoul, Korea (South), 2019.

[29] Z. Cai and N. Vasconcelos, "Cascade R-CNN: High Quality Object Detection and Instance Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

[30] J. Long, E. Shelhamer and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431– 3440, DOI: 10.1109/CVPR.2015.7298965, Boston, USA, 2015.

[31] K. He, et al., "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, DOI: 10.1109/CVPR.2016.90, Las Vegas, USA, 2016.

[32] J. Hu, L. Shen and G. Sun, "Squeeze-and-excitation Networks," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, DOI: 10.1109/CVPR.2018.00745, Salt Lake City, USA, 2018.

[33] T.-Y. Lin, M. Maire et al., "Microsoft COCO: Common Objects in Context," Proc. of the European Conference on Computer Vision (ECCV), pp. 740–755, DOI: 10.1007/978-3-319-10602-1_48, Part of the Lecture Notes in Computer Science Book Series (LNCS, vol. 8693), Springer, 2014.

[34] M. Abadi, A. Agarwal, P. Barham et al., "TensorFlow: Large-scale Machine Learning on Heterogeneous Distributed Systems," Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16), pp. 265-283, arXiv preprint arXiv: 1603.04467, 2016.