Vision Datasets

Optical character recognition (OCR)

Classes : 903,069 annotated scene-text words (32 words per image on average )
28,134 natural images from TextVQA

The US National Institute of Science publishes handwriting from 3600 writers, including more than 800,000 character images.

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms.

The ICDAR2003 dataset is a dataset for scene text recognition. It contains 507 natural scene images (including 258 training images and 249 test images) in total.

ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.

Devangri Characters

A dataset of handwritten Devangari characters, composed of 1800 samples from 36 character classes obtained by 25 native writers.

Mathematics Expressions

More than 10,000 expressions, including more than 101 mathematical symbols.

Chinese Characters

A dataset of handwritten Chinese characters containing 909,818 images that corresponds to about 10 news articles.

Arabic Printed Text

Contains a lexicon of 113,284 words, and uses 10 Arabic fonts.

Document database

Contains 941 online handwritten documents by 189 writers, and covers lists, tables, formulas, diagrams and drawings.

Iam On-line Handwriting

Contains forms of handwritten English text acquired on a whiteboard, and includes more than 1700 entries.

Street View Text

The Street View Text dataset was harvested from Google Street View, and mostly deals with outdoor street level signs and boards.

Street View House Numbers

Contains 73257 digits of house street numbers, taken from Google Street View.

Natural Environment OCR

A dataset that contains 659 real world images with 5238 annotations of text.

Contains 3000 images captured in different environments, including outdoors and indoors scenes under different lighting conditions (clear day, night, strong artificial lights, etc).

Contains 500 natural images, which are taken using a pocket camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards.

Contains handwritten words dataset collected by MIT Spoken Language Systems Group, published by Stanford.

This has 74K images of both English and Kannada digits.