ICDAR 2019 HDRC
The objective of ICDAR 2019 HDRC (Historical Document Reading Challenge) Chinese is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20.000 pages kindly provided by FamilySearch.
FamilySearch-DB is a collection of Chinese manuscripts that have been chosen regarding the complexity of their layout in semantic structure and font. All manuscripts are annotated using Aletheia External link., an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. The annotation of the manuscripts are available in PAGE XML
External link. format, a sophisticated XML schema which is component of the PAGE (Page Analysis and Ground truth Elements) Format Framework.
We propose 3 different tasks for this competition:
Task 1: Handwritten Character Recognition on extracted textlines
Task 2: Layout Analysis on structured historical document images
Task 3: Complete, integrated textline detection and recognition on a large dataset
ICDAR 2019 HDRC Chinese is organised by the EISLAB-Machine Learning group at LTU.
Registration
Send your registration by email and receive the training data. E-mail: foteini.liwicki@ltu.se
Evaluation tools
Task 1: Available for registered participants
Task 2: Available on GitHub External link.
Task 3: Available for registered participants
Referencing the Data
In any research publication or communication about performance results (e.g., on blogs or news articles) the source of the data has to be referred to, i.e., the ICDAR 2019 HDRC-Chinese dataset, as following:
Frequently Asked Questions (FAQ)
Q1. For some images, the bounding boxes strictly include only text areas, while for some other images, they also contain a large amount of empty space with no text. Which kind of labeling is more advisable?
A1. The huge whitespaces will not harm the recognition scores – so your algorithm maybe be more neat or not, as you prefer. The annotations were made by human domain experts and unfortunately they sometimes differ. In the evaluation for Task 2, for example, we will not take background regions within the boxes into account; for Task 3, we will only focus on the correct transcription.
Q2. There are different annotations for the space character. Which one should we use?
A2. During evaluation, we will map all space characters to a single space character (32), so it does not matter what you report.
Updated: