Skip to main navigation menu Skip to main content Skip to site footer

2D Positional Embedding-based Transformer for Scene Text Recognition


Recent state-of-the-art scene text recognition methods are primarily based on Recurrent Neural Networks (RNNs), however, these methods require one-dimensional (1D) features and are not designed for recognizing irregular-text instances due to the loss of spatial information present in the original two-dimensional (2D) images.  In this paper, we leverage a Transformer-based architecture for recognizing both regular and irregular text-in-the-wild images. The proposed method takes advantage of using a 2D positional encoder with the Transformer architecture to better preserve the spatial information of 2D image features than previous methods. The experiments on popular benchmarks, including the challenging COCO-Text dataset, demonstrate that the proposed scene text recognition method outperformed the state-of-the-art in most cases, especially on irregular-text recognition.