Skip to main navigation menu Skip to main content Skip to site footer

Articles

Vol. 6 No. 1 (2020): Special Issue: Proceedings of CVIS 2020

2D Positional Embedding-based Transformer for Scene Text Recognition

DOI
https://doi.org/10.15353/jcvis.v6i1.3533
Submitted
January 15, 2021
Published
2021-01-15

Abstract

Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images.  In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition.