Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
Zhihao Lin 648d63a44b | 6 days ago | |
---|---|---|
.. | ||
finetune | 6 days ago | |
pretrain | 6 days ago | |
README.md | 2 weeks ago | |
convert_xtuner_weights_to_hf.py | 6 days ago | |
convert_xtuner_weights_to_llava.py | 2 weeks ago |
Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - |
LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | Pretrain / Fine-tune |
LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | Pretrain / Fine-tune |
LLaVA-Llama-3-8B-v1.1
xtuner/llava-llama-3-8b-v1_1-hf
): 🤗 HuggingFace / 🤖 ModelScopextuner/llava-llama-3-8b-v1_1-transformers
): 🤗 HuggingFace / 🤖 ModelScopextuner/llava-llama-3-8b-v1_1
): 🤗 HuggingFace / 🤖 ModelScopextuner/llava-llama-3-8b-v1_1-gguf
): 🤗 HuggingFace / 🤖 ModelScopeLLaVA-Llama-3-8B
xtuner/llava-llama-3-8b-hf
): 🤗 HuggingFace / 🤖 ModelScopextuner/llava-llama-3-8b-transformers
): 🤗 HuggingFace / 🤖 ModelScopextuner/llava-llama-3-8b
): 🤗 HuggingFace / 🤖 ModelScope./data/llava_data
├── LLaVA-Pretrain
│ ├── blip_laion_cc_sbu_558k.json
│ ├── blip_laion_cc_sbu_558k_meta.json
│ └── images
├── LLaVA-Instruct-150K
│ └── llava_v1_5_mix665k.json
└── llava_images
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
LLaVA-Pretrain
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
Text data
LLaVA-Instruct-150K
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
Image data
COCO (coco): download url
GQA (gqa): download url
OCR-VQA (ocr_vqa): download script
⚠️ Modify the name of OCR-VQA's images to keep the extension as .jpg
!
#!/bin/bash
ocr_vqa_path="<your-directory-path>"
find "$target_dir" -type f | while read file; do
extension="${file##*.}"
if [ "$extension" != "jpg" ]
then
cp -- "$file" "${file%.*}.jpg"
fi
done
TextVQA (textvqa): download url
Reference: https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
./data/sharegpt4v
├── share-captioner_coco_lcs_sam_1246k_1107.json
├── sharegpt4v_instruct_gpt4-vision_cap100k.json
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
└── data
├── sam
│ └── images
├── share_textvqa
│ └── images
├── web-celebrity
│ └── images
├── web-landmark
│ └── images
├── wikiart
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
└── vg -> ../../llava_data/llava_images/vg
Text data
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
Image data
SAM (sam): download url
ShareTextVQA (share_textvqa): download url
Web-Celebrity (web-celebrity): download url
Web-Landmark (web-landmark): download url
WikiArt (wikiart): download url
llava, coco , gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
Reference: https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets
./data/internvl_sft
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
└── data
├── ai2d
│ ├── abc_images
│ └── images
├── chartqa
│ ├── test
│ ├── train
│ └── val
├── docvqa
│ ├── test
│ ├── train
│ └── val
├── dvqa
│ └── images
├── synthdog-en
│ └── images
├── geoqa+
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
├── vg -> ../../llava_data/llava_images/vg
├── sam -> ../../sharegpt4v/data/sam
├── share_textvqa -> ../../sharegpt4v/data/share_textvqa
├── web-celebrity -> ../../sharegpt4v/data/web-celebrity
├── web-landmark -> ../../sharegpt4v/data/web-landmark
└── wikiart -> ../../sharegpt4v/data/wikiart
Text data
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip
unzip ./playground.zip
Image data
AI2D (ai2d): download url
ChartQA (chartqa): download url
DVQA (dvqa): download url
SynthDoG-EN (synthdog-en): download url
GeoQA+ (geoqa+): download url
llava, coco, gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
sam, share_textvqa, web-celebrity, web-landmark, wikiart: Please refer to the preparation of ShareGPT4V dataset.
./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/
)NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024
./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune/
)NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2 --seed 1024
./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/
)NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024
./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune/
)NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024
XTuner also supports single-card training for LLaVA-Llama-3-8B (Youth Edition), requiring only a single card with 20GB to complete the entire process of multi-modal training.
./work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/
)xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain --deepspeed deepspeed_zero2 --seed 1024
./work_dirs/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune/
)xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024
.pth
file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)After training, we will obtain a set of weights (i.e., iter_xxx.pth
), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting.
It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.
./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── projector
│ ├── config.json
│ ├── configuration_projector.py
│ ├── modeling_projector.py
│ └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── visual_encoder_adapter
├── adapter_config.json
├── adapter_model.safetensors
└── README.md
At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by
xtuner chat ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--image $IMAGE_PATH
and in MMBench evaluation, by
xtuner mmbench ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
Here, $DATA_PATH
refers to one of the mmbench datasets. You can download the expected data by
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.
xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip
We can utilize the following command to obtain the LLaVA model in the official LLaVA format.
python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava
Here, the converted LLaVA model in official LLaVA format is saved to ./iter_39620_llava
.
./iter_39620_llava
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.
python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf
Here, the converted LLaVA model in HuggingFace LLaVA format is saved to ./iter_39620_hf
.
./iter_39620_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
LMDeploy now supports the deployment of official LLaVA format models (e.g.,xtuner/llava-llama-3-8b-v1_1-hf). For specifics, please refer to here.
No Description
Python Markdown other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》