Document Visual Question Answering (DocVQA)
ICDAR 2024 COMPETITION ON RECOGNITION AND VQA ON HANDWRITTEN DOCUMENTS
Members of LTU's Machine Learning group participated in the ICDAR 2024 HWD (https://ilocr.iiit.ac.in/icdar_2024_hwd/ External link.) contest. These participants included: Chang Liu, Tosin Adewumi and Elisa Barney Smith. Their submission placed second in Task C of this competition: “Visual Question Answer on Handwritten Documents”, where the model is given US census images and asked to answer questions.
The contest used a dataset known as SP-HWVQA-1.0 for performing visual question answering on single-page handwritten documents [1]. It is partitioned into two sets: the training and test sets, which both contain the same 250 images. The training set has 750 question and answer pairs. The test set has 250 question and answer pairs.
On the method selection, we first tried various MLMs like LLaVA-1.5-7B [2] and SPHINX-tiny [3], and also other methods like HiVT5 [4], Donut [5], etc, without fine-tuning, to grasp a basic sense of their capability on this specific task, then we narrowed down to the LLaVA 1.5, SPHINX-tiny, and HiVT5 for finetuning runs.
The finetuning method we explored involved trying out 3 approaches to splitting the training data into training and validation sets in the ratio 9:1 so that we can get a sense of which is more viable for the final test set that was to be released. The approaches are:
- All randomized before splitting 75 for validation set.
- The first 75 (out of 750) are split for validation (without randomization)
- The last 75 samples are split for validation (without randomization)
We considered the first approach (randomized) more reliable though its results were slightly less than the other two in trials because the other two may suffer from selection bias.
After finetuning, we found that HiVT5 faced severe overfitting issues, and the fine-tuned SPHINX model to some extent outperforms the fine-tuned LLaVA model, so we ended up with the SPHINX model. When the test set was released, we used the initial full training set (randomized) and used our best training protocol for the new training before making predictions on the test set.
- [1] https://ilocr.iiit.ac.in/icdar_2024_hwd/dataset.html
External link, opens in new window.
- [2] https://arxiv.org/abs/2310.03744
External link.
- [3] https://github.com/Alpha-VLLM/LLaMA2-Accessory
External link.
- [4] https://arxiv.org/pdf/2212.05935
External link.
- [5] https://github.com/clovaai/donut
External link.
Updated: