Sijun He 505b77e082 | 1 year ago | |
---|---|---|
.. | ||
deploy/python | 1 year ago | |
README.md | 1 year ago | |
README_ch.md | 1 year ago | |
data_collator.py | 1 year ago | |
export_model.py | 1 year ago | |
finetune_args.py | 1 year ago | |
layout_trainer.py | 1 year ago | |
requirements.txt | 1 year ago | |
run_cls.py | 1 year ago | |
run_mrc.py | 1 year ago | |
run_ner.py | 1 year ago | |
utils.py | 1 year ago |
English | 简体中文
content
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.
The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout through PaddleNLP.
🧾 HuggingFace web demo is available here
[
{"doc": "./book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]},
{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}
]
Default to use PaddleOCR, you can also use your own OCR result via word_boxes
, the data format is List[str, List[float, float, float, float]]
。
[
{"doc": doc_path, "prompt": prompt, "word_boxes": word_boxes}
]
Support single and batch input
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> docprompt = Taskflow("document_intelligence", lang="en")
>>> docprompt([{"doc": "https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/book.png", "prompt": ["What is the name of the author of 'The Adventure Zone: The Crystal Kingdom’?", "What type of book cover does The Adventure Zone: The Crystal Kingdom have?", "For Rage, who is the author listed as?"]}])
[{'prompt': "What is the name of the author of 'The Adventure Zone: The "
'Crystal Kingdom’?',
'result': [{'end': 39,
'prob': 0.99,
'start': 22,
'value': 'Clint McElroy. Carey Pietsch, Griffn McElroy, Travis '
'McElroy'}]},
{'prompt': 'What type of book cover does The Adventure Zone: The Crystal '
'Kingdom have?',
'result': [{'end': 51, 'prob': 1.0, 'start': 51, 'value': 'Paperback'}]},
{'prompt': 'For Rage, who is the author listed as?',
'result': [{'end': 93, 'prob': 1.0, 'start': 91, 'value': 'Bob Woodward'}]}]
>>> from pprint import pprint
>>> from paddlenlp import Taskflow
>>> docprompt = Taskflow("document_intelligence")
>>> pprint(docprompt([{"doc": "./resume.png", "prompt": ["五百丁本次想要担任的是什么职位?", "五百丁是在哪里上的大学?", "大学学的是什么专业?"]}]))
[{'prompt': '五百丁本次想要担任的是什么职位?',
'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},
{'prompt': '五百丁是在哪里上的大学?',
'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},
{'prompt': '大学学的是什么专业?',
'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}]
Parameter Description
batch_size
: number of input of each batch, default to 1.lang
: PaddleOCR language, en
is better to English images, default to ch
.topn
: return the top n results with highest probability, default to 1.Dataset
Dataset | Task | Language | Note |
---|---|---|---|
FUNSD | Key Information Extraction | English | - |
XFUND-ZH | Key Information Extraction | Chinese | - |
DocVQA-ZH | Document Question Answering | Chinese | The submission of the competition of DocVQA-ZH is now closed so we split original dataset into three parts for model evluation. There are 4,187 training images, 500 validation images, and 500 test images. |
RVL-CDIP (sampled) | Document Image Classification | English | The RVL-CDIP dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Because of the original dataset is large and slow for training, so we downsampling from it. The sampled dataset consist of 6,400 training images, 800 validation images, and 800 test images. |
Results
Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH |
---|---|---|---|---|
LayoutXLM-Base | 86.72 | 90.88 | 86.24 | 66.01 |
ERNIE-LayoutX-Base | 89.31 | 90.29 | 88.58 | 69.57 |
Evaluation Methods
All the above tasks do the Hyper Parameter searching based on Grid Search method. The evaluation step interval of FUNSD and XFUND-ZH are both 100, metric is F1-Score. The evaluation step interval of RVL-CDIP is 2000, metric is Accuracy. The evaluation step interval of DocVQA-ZH is 10000, metric is ANLS,
Hyper Parameters search ranges
Hyper Parameters | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH |
---|---|---|---|---|
learning_rate | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 | 5e-6, 1e-5, 2e-5, 5e-5 |
batch_size | 1, 2, 4 | 8, 16, 24 | 1, 2, 4 | 8, 16, 24 |
warmup_ratio | - | 0, 0.05, 0.1 | - | 0, 0.05, 0.1 |
The strategy of lr_scheduler_type
for FUNSD and XFUND is constant, so warmup_ratio is excluded.
max_steps
is applied for the fine-tuning on both FUNSD and XFUND-ZH, 10000 steps and 20000 steps respectively; num_train_epochs
is set to 6 and 20 for DocVQA-ZH and RVL-CDIP respectively.
Best Hyper Parameter
Model | FUNSD | RVL-CDIP (sampled) | XFUND-ZH | DocVQA-ZH |
---|---|---|---|---|
LayoutXLM-Base | 1e-5, 2, _ | 1e-5, 8, 0.1 | 1e-5, 2, _ | 2e-5. 8, 0.1 |
ERNIE-LayoutX-Base | 2e-5, 4, _ | 1e-5, 8, 0. | 1e-5, 4, _ | 2e-5. 8, 0.05 |
pip install -r requirements.txt
python -u run_ner.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/funsd/ \
--dataset_name funsd \
--do_train \
--do_eval \
--max_steps 10000 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern ner-bio \
--preprocessing_num_workers 4 \
--overwrite_cache false \
--use_segment_box \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 2e-5 \
--lr_scheduler_type constant \
--gradient_accumulation_steps 1 \
--seed 1000 \
--metric_for_best_model eval_f1 \
--greater_is_better true \
--overwrite_output_dir
python -u run_ner.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/xfund_zh/ \
--dataset_name xfund_zh \
--do_train \
--do_eval \
--lang "ch" \
--max_steps 20000 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern ner-bio \
--preprocessing_num_workers 4 \
--overwrite_cache false \
--use_segment_box \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 1e-5 \
--lr_scheduler_type constant \
--gradient_accumulation_steps 1 \
--seed 1000 \
--metric_for_best_model eval_f1 \
--greater_is_better true \
--overwrite_output_dir
python3 -u run_mrc.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/docvqa_zh/ \
--dataset_name docvqa_zh \
--do_train \
--do_eval \
--lang "ch" \
--num_train_epochs 6 \
--lr_scheduler_type linear \
--warmup_ratio 0.05 \
--weight_decay 0 \
--eval_steps 10000 \
--save_steps 10000 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern "mrc" \
--use_segment_box false \
--return_entity_level_metrics false \
--overwrite_cache false \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--learning_rate 2e-5 \
--preprocessing_num_workers 32 \
--save_total_limit 1 \
--train_nshard 16 \
--seed 1000 \
--metric_for_best_model anls \
--greater_is_better true \
--overwrite_output_dir
python3 -u run_cls.py \
--model_name_or_path ernie-layoutx-base-uncased \
--output_dir ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ \
--dataset_name rvl_cdip_sampled \
--do_train \
--do_eval \
--num_train_epochs 20 \
--lr_scheduler_type linear \
--max_seq_length 512 \
--warmup_ratio 0.05 \
--weight_decay 0 \
--eval_steps 2000 \
--save_steps 2000 \
--save_total_limit 1 \
--load_best_model_at_end \
--pattern "cls" \
--use_segment_box \
--return_entity_level_metrics false \
--overwrite_cache false \
--doc_stride 128 \
--target_size 1000 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--learning_rate 1e-5 \
--preprocessing_num_workers 32 \
--train_nshard 16 \
--seed 1000 \
--metric_for_best_model acc \
--greater_is_better true \
--overwrite_output_dir
After fine-tuning, you can also export the inference model via Model Export Script, the inference model will be saved in the output_path
you specified.
python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export
python export_model.py --task_type mrc --model_path ./ernie-layoutx-base-uncased/models/docvqa_zh/ --output_path ./mrc_export
python export_model.py --task_type cls --model_path ./ernie-layoutx-base-uncased/models/rvl_cdip_sampled/ --output_path ./cls_export
Parameter Description
model_path
:the save directory of dygraph model parameters, default to "./checkpoint/"。output_path
:the save directory of static graph model parameters, default to "./export"。Directory
export/
├── inference.pdiparams
├── inference.pdiparams.info
└── inference.pdmodel
We provide the deploy example on Key Information Extraction, Document Question Answering and Document Image Classification, please follow the ERNIE-Layout Python Deploy Guide
👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Documen
Python C++ Cuda Shell Markdown other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》