ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

OCR-VQA: Visual question answering by reading text in images

Mishra, A and Shekhar, S and Singh, AK and Chakraborty, A (2019) OCR-VQA: Visual question answering by reading text in images. In: 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019, 20 - 25 September 2019, Sydney, pp. 947-952.

[img] PDF
ICDAR_2019.pdf - Published Version
Restricted to Registered users only

Download (772kB) | Request a copy
Official URL: https://doi.org/10.1109/ICDAR.2019.00156

Abstract

The problem of answering questions about an image is popularly known as visual question answering (or VQA in short). It is a well-established problem in computer vision. However, none of the VQA methods currently utilize the text often present in the image. These 'texts in images' provide additional useful cues and facilitate better understanding of the visual content. In this paper, we introduce a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR. We refer to this problem as OCR-VQA. To facilitate a systematic way of studying this new problem, we introduce a large-scale dataset, namely OCRVQA-200K. This dataset comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. We judiciously combine well-established techniques from OCR and VQA domains to present a novel baseline for OCR-VQA-200K. The experimental results and rigorous analysis demonstrate various challenges present in this dataset leaving ample scope for the future research. We are optimistic that this new task along with compiled dataset will open-up many exciting research avenues both for the document image analysis and the VQA communities. © 2019 IEEE.

Item Type: Conference Paper
Publication: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
Publisher: IEEE Computer Society
Additional Information: The copyright for this article belongs to IEEE Computer Society.
Keywords: Large dataset; Optical character recognition, Document image analysis; Large-scale dataset; Optical character recognition (OCR); Question Answering; Question-answer pairs; Rigorous analysis; TextVQA; Well-established techniques, Image analysis
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 06 Jan 2023 06:28
Last Modified: 06 Jan 2023 06:28
URI: https://eprints.iisc.ac.in/id/eprint/78809

Actions (login required)

View Item View Item