Document Question Answering
Document Question Answering (also known as Document Visual Question Answering) is the task of answering questions on document images. Document question answering models take a (document, question) pair as input and return an answer in natural language. Models usually rely on multi-modal features, combining text, position of words (bounding-boxes) and image.
Question
What is the idea behind the consumer relations efficiency team?
Answer
Balance cost efficiency with quality customer service
About Document Question Answering
Use Cases
Document Question Answering models can be used to answer natural language questions about documents. Typically, document QA models consider textual, layout and potentially visual information. This is useful when the question requires some understanding of the visual aspects of the document. Nevertheless, certain document QA models can work without document images. Hence the task is not limited to visually-rich documents and allows users to ask questions based on spreadsheets, text PDFs, etc!
Document Parsing
One of the most popular use cases of document question answering models is the parsing of structured documents. For example, you can extract the name, address, and other information from a form. You can also use the model to extract information from a table, or even a resume.
Invoice Information Extraction
Another very popular use case is invoice information extraction. For example, you can extract the invoice number, the invoice date, the total amount, the VAT number, and the invoice recipient.
Inference
You can infer with Document QA models with the π€ Transformers library using the document-question-answering
pipeline. If no model checkpoint is given, the pipeline will be initialized with impira/layoutlm-document-qa
. This pipeline takes question(s) and document(s) as input, and returns the answer.
π Note that the question answering task solved here is extractive: the model extracts the answer from a context (the document).
from transformers import pipeline
from PIL import Image
pipe = pipeline("document-question-answering", model="naver-clova-ix/donut-base-finetuned-docvqa")
question = "What is the purchase amount?"
image = Image.open("your-document.png")
pipe(image=image, question=question)
## [{'answer': '20,000$'}]
Useful Resources
Would you like to learn more about Document QA? Awesome! Here are some curated resources that you may find helpful!
- Document Visual Question Answering (DocVQA) challenge
- DocVQA: A Dataset for Document Visual Question Answering (Dataset paper)
- ICDAR 2021 Competition on Document Visual Question Answering (Conference paper)
- HuggingFace's Document Question Answering pipeline
- Github repo: DocQuery - Document Query Engine Powered by Large Language Models
Notebooks
- Fine-tuning Donut on DocVQA dataset
- Fine-tuning LayoutLMv2 on DocVQA dataset
- Accelerating Document AI
Documentation
The contents of this page are contributed by Eliott Zemour and reviewed by Kwadwo Agyapon-Ntra and Ankur Goyal.
Compatible libraries
Note A LayoutLM model for the document QA task, fine-tuned on DocVQA and SQuAD2.0.
Note A special model for OCR-free Document QA task. Donut model fine-tuned on DocVQA.
Note A powerful model for document question answering.
Note Dataset from the 2020 DocVQA challenge. The documents are taken from the UCSF Industry Documents Library.
Note A robust document question answering application.
Note An application that can answer questions from invoices.
Note An application to compare different document question answering models.
- anls
- The evaluation metric for the DocVQA challenge is the Average Normalized Levenshtein Similarity (ANLS). This metric is flexible to character regognition errors and compares the predicted answer with the ground truth answer.
- exact-match
- Exact Match is a metric based on the strict character match of the predicted answer and the right answer. For answers predicted correctly, the Exact Match will be 1. Even if only one character is different, Exact Match will be 0