History

Sijun He 505b77e082 [CodeStyles] fix all of `examples` (#5394 ) * ernie-gen * ernie-layout * ie * qa * tipc * trans * question generation * word embedding * semantic indexing * text correction * text to knowledge * examples * text graph * text matching * fix utils		1 year ago
..
deploy/python	[CodeStyles] fix all of `examples` (#5394)	1 year ago

README.md	fix demo (#4277)	1 year ago

README_ch.md	fix demo (#4277)	1 year ago

data_collator.py	black format all (#4014)	1 year ago

export_model.py	black format all (#4014)	1 year ago

finetune_args.py	black format all (#4014)	1 year ago

layout_trainer.py	[CodeStyles] fix all of `examples` (#5394)	1 year ago

requirements.txt	Update README.md and optimize DocPrompt postprocess (#3441)	1 year ago

run_cls.py	[ModelZoo] Refactor ERNIE-Layout Usage and add unittest (#4170)	1 year ago

run_mrc.py	[ModelZoo] Refactor ERNIE-Layout Usage and add unittest (#4170)	1 year ago

run_ner.py	[ModelZoo] Refactor ERNIE-Layout Usage and add unittest (#4170)	1 year ago

utils.py	add the uie-x (#3951)	1 year ago

README.md

ERNIE-Layout

English | 简体中文

ERNIE-Layout

content

ERNIE-Layout

1. Model Instruction

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.

The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP.

2. Out-of-Box

HuggingFace web demo

🧾 HuggingFace web demo is available here

Demo show

Invoice VQA

Poster VQA

WebPage VQA

Table VQA

Exam Paper VQA

English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt

Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, DE) prompt

Demo images are available here

Taskflow

Input Format

[
  {"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]},
  {"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]

Default to use PaddleOCR, you can also use your own OCR result via word_boxes, the data format is List[str, List[float, float, float, float]]。

[
  {"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]

Support single and batch input

Image from http link

>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence", lang="en")
>>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}])
[{'prompt': "What is the name of the author of 'The Adventure Zone: The "
            'Crystal Kingdom’?',
  'result': [{'end': 39,
              'prob': 0.99,
              'start': 22,
              'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis '
                      'McElroy'}]},
{'prompt': 'What type of book cover does The Adventure Zone: The Crystal '
            'Kingdom have?',
  'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]},
{'prompt': 'For Rage, who is the author listed as?',
  'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}]

Image from local path

>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> docprompt = Taskflow("document_intelligence")
>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
[{'prompt': '五百丁本次想要担任的是什么职位?',
  'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
{'prompt': '五百丁是在哪里上的大学?',
  'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
{'prompt': '大学学的是什么专业?',
  'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科）'}]}]

Parameter Description
- batch_size: number of input of each batch, default to 1.
- lang: PaddleOCR language, en is better to English images, default to ch.
- topn: return the top n results with highest probability, default to 1.

3. Model Performance

Dataset

Dataset	Task	Language	Note
FUNSD	Key Information Extraction	English	-
XFUND-ZH	Key Information Extraction	Chinese	-
DocVQA-ZH	Document Question Answering	Chinese	The submission of the competition of DocVQA-ZH is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images.
RVL-CDIP (sampled)	Document Image Classification	English	The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images.

Results

Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH

LayoutXLM-Base 86.72 90.88 86.24 66.01

ERNIE-LayoutX-Base 89.31 90.29 88.58 69.57

Model	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
LayoutXLM-Base	86.72	90.88	86.24	66.01
ERNIE-LayoutX-Base	89.31	90.29	88.58	69.57

Evaluation Methods

All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is F1-Score. The evaluation step interval of RVL-CDIP is 2000, metric is Accuracy. The evaluation step interval of DocVQA-ZH is 10000, metric is ANLS,

Hyper Parameters search ranges

Hyper Parameters	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
learning_rate	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5	5e-6, 1e-5, 2e-5, 5e-5
batch_size	1, 2, 4	8, 16, 24	1, 2, 4	8, 16, 24
warmup_ratio	-	0, 0.05, 0.1	-	0, 0.05, 0.1

The strategy of lr_scheduler_type for FUNSD and XFUND is constant, so warmup_ratio is excluded.

max_steps is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively; num_train_epochs is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively.

Best Hyper Parameter

Model FUNSD RVL-CDIP (sampled) XFUND-ZH DocVQA-ZH

LayoutXLM-Base 1e-5, 2, _ 1e-5, 8, 0.1 1e-5, 2, _ 2e-5. 8, 0.1

ERNIE-LayoutX-Base 2e-5, 4, _ 1e-5, 8, 0. 1e-5, 4, _ 2e-5. 8, 0.05

Model	FUNSD	RVL-CDIP (sampled)	XFUND-ZH	DocVQA-ZH
LayoutXLM-Base	1e-5, 2, _	1e-5, 8, 0.1	1e-5, 2, _	2e-5. 8, 0.1
ERNIE-LayoutX-Base	2e-5, 4, _	1e-5, 8, 0.	1e-5, 4, _	2e-5. 8, 0.05

4. Fine-tuning Examples

Installation

pip install -r requirements.txt

4.1 Key Information Extraction

FUNSD Train

python -u run_ner.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
  --dataset_name funsd \
  --do_train \
  --do_eval \
  --max_steps 10000 \
  --eval_steps 100 \
  --save_steps 100 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern ner-bio \
  --preprocessing_num_workers 4 \
  --overwrite_cache false \
  --use_segment_box \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --learning_rate 2e-5 \
  --lr_scheduler_type constant \
  --gradient_accumulation_steps 1 \
  --seed 1000 \
  --metric_for_best_model eval_f1 \
  --greater_is_better true \
  --overwrite_output_dir

XFUND-ZH Train

python -u run_ner.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
  --dataset_name xfund_zh \
  --do_train \
  --do_eval \
  --lang "ch" \
  --max_steps 20000 \
  --eval_steps 100 \
  --save_steps 100 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern ner-bio \
  --preprocessing_num_workers 4 \
  --overwrite_cache false \
  --use_segment_box \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --learning_rate 1e-5 \
  --lr_scheduler_type constant \
  --gradient_accumulation_steps 1 \
  --seed 1000 \
  --metric_for_best_model eval_f1 \
  --greater_is_better true \
  --overwrite_output_dir

4.2 Document Question Answering

DocVQA-ZH Train

python3 -u run_mrc.py \
  --model_name_or_path ernie-layoutx-base-uncased \
  --output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
  --dataset_name docvqa_zh \
  --do_train \
  --do_eval \
  --lang "ch" \
  --num_train_epochs 6 \
  --lr_scheduler_type linear \
  --warmup_ratio 0.05 \
  --weight_decay 0 \
  --eval_steps 10000 \
  --save_steps 10000 \
  --save_total_limit 1 \
  --load_best_model_at_end \
  --pattern "mrc" \
  --use_segment_box false \
  --return_entity_level_metrics false \
  --overwrite_cache false \
  --doc_stride 128 \
  --target_size 1000 \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --learning_rate 2e-5 \
  --preprocessing_num_workers 32 \
  --save_total_limit 1 \
  --train_nshard 16 \
  --seed 1000 \
  --metric_for_best_model anls \
  --greater_is_better true \
  --overwrite_output_dir

4.3 Document Image Classification

RVL-CDIP Train

python3 -u run_cls.py \
    --model_name_or_path ernie-layoutx-base-uncased \
    --output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
    --dataset_name rvl_cdip_sampled \
    --do_train \
    --do_eval \
    --num_train_epochs 20 \
    --lr_scheduler_type linear \
    --max_seq_length 512 \
    --warmup_ratio 0.05 \
    --weight_decay 0 \
    --eval_steps 2000 \
    --save_steps 2000 \
    --save_total_limit 1 \
    --load_best_model_at_end \
    --pattern "cls" \
    --use_segment_box \
    --return_entity_level_metrics false \
    --overwrite_cache false \
    --doc_stride 128 \
    --target_size 1000 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --learning_rate 1e-5 \
    --preprocessing_num_workers 32 \
    --train_nshard 16 \
    --seed 1000 \
    --metric_for_best_model acc \
    --greater_is_better true \
    --overwrite_output_dir

5. Deploy

5.1 Inference Model Export

After fine-tuning, you can also export the inference model via Model Export Script, the inference model will be saved in the output_path you specified.

Export the model fine-tuned on FUNSD

python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export

Export the model fine-tuned on DocVQA-ZH

python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export

Export the model fine-tuned on RVL-CDIP(sampled)

python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export

Parameter Description
- model_path：the save directory of dygraph model parameters, default to "./checkpoint/"。
- output_path：the save directory of static graph model parameters, default to "./export"。

5.2 Python Deploy

We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the ERNIE-Layout Python Deploy Guide

References

👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Documen

Python C++ Cuda Shell Markdown other

fangzeyang0904@hotmail.com sijun.he@hotmail.com zhonghui.net@gmail.com zhoushunjie@baidu.com 380185688@qq.com chenzeyu01@baidu.com 40840292+linjieccc@users.noreply.github.com 1435130236@qq.com 50394665+JunnYu@users.noreply.github.com 63761690+lugimzzz@users.noreply.github.com yyb0576@163.com 33639025+smallv0221@users.noreply.github.com 709153940@qq.com 623543001@qq.com gongel@qq.com wanghuijuan03@baidu.com 397551318@qq.com w5688414@gmail.com liujiaqi06@baidu.com tianxin04@baidu.com westfish@126.com 1834792141@qq.com 48793257+Steffy-zxf@users.noreply.github.com kinghuin_chull@163.com chenshuo07@baidu.com

How to access data resources in code