History

Zhihao Lin 648d63a44b [Fix] Fix batch-size setting of single-card LLaVA-Llama-3-8B configs (#598 ) * Update llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py * Update llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py		6 days ago
..
finetune	[Fix] Fix batch-size setting of single-card LLaVA-Llama-3-8B configs (#598)	6 days ago

pretrain	[Fix] Fix batch-size setting of single-card LLaVA-Llama-3-8B configs (#598)	6 days ago

README.md	[Feature] Add conversion scripts for LLaVA-Llama-3-8B (#618)	2 weeks ago

convert_xtuner_weights_to_hf.py	[Fix] `convert_xtuner_weights_to_hf` with frozen ViT (#661)	6 days ago

convert_xtuner_weights_to_llava.py	[Feature] Add conversion scripts for LLaVA-Llama-3-8B (#618)	2 weeks ago

README.md

LLaVA-Llama-3-8B

Results

Model	MMBench Test (EN)	MMBench Test (CN)	CCBench Dev	MMMU Val	SEED-IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar	Configs
LLaVA-v1.5-7B	66.5	59.0	27.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3	-
LLaVA-Llama-3-8B	68.9	61.6	30.4	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2	Pretrain / Fine-tune
LLaVA-Llama-3-8B-v1.1	72.3	66.4	31.6	36.8	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1	Pretrain / Fine-tune

Resources

LLaVA-Llama-3-8B-v1.1
- Official LLaVA format model (xtuner/llava-llama-3-8b-v1_1-hf): 🤗 HuggingFace / 🤖 ModelScope
- HuggingFace LLaVA format model (xtuner/llava-llama-3-8b-v1_1-transformers): 🤗 HuggingFace / 🤖 ModelScope
- XTuner LLaVA format model (xtuner/llava-llama-3-8b-v1_1): 🤗 HuggingFace / 🤖 ModelScope
- GGUF model (xtuner/llava-llama-3-8b-v1_1-gguf): 🤗 HuggingFace / 🤖 ModelScope
- Pretrained projector weights: 🤗 HuggingFace / 🤖 ModelScope
LLaVA-Llama-3-8B
- Official LLaVA format model (xtuner/llava-llama-3-8b-hf): 🤗 HuggingFace / 🤖 ModelScope
- HuggingFace LLaVA format model (xtuner/llava-llama-3-8b-transformers): 🤗 HuggingFace / 🤖 ModelScope
- XTuner LLaVA format model (xtuner/llava-llama-3-8b): 🤗 HuggingFace / 🤖 ModelScope
- Pretrained projector weights: 🤗 HuggingFace / 🤖 ModelScope

Data Preparation

LLaVA dataset

File structure

./data/llava_data
├── LLaVA-Pretrain
│   ├── blip_laion_cc_sbu_558k.json
│   ├── blip_laion_cc_sbu_558k_meta.json
│   └── images
├── LLaVA-Instruct-150K
│   └── llava_v1_5_mix665k.json
└── llava_images
    ├── coco
    │   └── train2017
    ├── gqa
    │   └── images
    ├── ocr_vqa
    │   └── images
    ├── textvqa
    │   └── train_images
    └── vg
        ├── VG_100K
        └── VG_100K_2

Pretrain

LLaVA-Pretrain

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1

Finetune

Text data

LLaVA-Instruct-150K

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1

Image data

COCO (coco): download url
GQA (gqa): download url

OCR-VQA (ocr_vqa): download script

⚠️ Modify the name of OCR-VQA's images to keep the extension as .jpg!

#!/bin/bash
ocr_vqa_path="<your-directory-path>"

find "$target_dir" -type f | while read file; do
    extension="${file##*.}"
    if [ "$extension" != "jpg" ]
    then
        cp -- "$file" "${file%.*}.jpg"
    fi
done

TextVQA (textvqa): download url
VisualGenome (VG): part1, part2

ShareGPT4V dataset

Reference: https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md

File structure

./data/sharegpt4v
├── share-captioner_coco_lcs_sam_1246k_1107.json
├── sharegpt4v_instruct_gpt4-vision_cap100k.json
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
└── data
    ├── sam
    │   └── images
    ├── share_textvqa
    │   └── images
    ├── web-celebrity
    │   └── images
    ├── web-landmark
    │   └── images
    ├── wikiart
    │   └── images
    ├── llava
    │   └── llava_pretrain
    │       └── images -> ../../../../llava_data/LLaVA-Pretrain/images
    ├── coco -> ../../llava_data/llava_images/coco
    ├── gqa -> ../../llava_data/llava_images/gqa
    ├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
    ├── textvqa -> ../../llava_data/llava_images/textvqa
    └── vg -> ../../llava_data/llava_images/vg

Download

Text data

wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json

Image data
1. SAM (sam): download url
2. ShareTextVQA (share_textvqa): download url
3. Web-Celebrity (web-celebrity): download url
4. Web-Landmark (web-landmark): download url
5. WikiArt (wikiart): download url
6. llava, coco , gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.

InternVL-SFT

Reference: https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets

File structure

./data/internvl_sft
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
└── data
    ├── ai2d
    │   ├── abc_images
    │   └── images
    ├── chartqa
    │   ├── test
    │   ├── train
    │   └── val
    ├── docvqa
    │   ├── test
    │   ├── train
    │   └── val
    ├── dvqa
    │   └── images
    ├── synthdog-en
    │   └── images
    ├── geoqa+
    │   └── images
    ├── llava
    │   └── llava_pretrain
    │       └── images -> ../../../../llava_data/LLaVA-Pretrain/images
    ├── coco -> ../../llava_data/llava_images/coco
    ├── gqa -> ../../llava_data/llava_images/gqa
    ├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
    ├── textvqa -> ../../llava_data/llava_images/textvqa
    ├── vg -> ../../llava_data/llava_images/vg
    ├── sam -> ../../sharegpt4v/data/sam
    ├── share_textvqa -> ../../sharegpt4v/data/share_textvqa
    ├── web-celebrity -> ../../sharegpt4v/data/web-celebrity
    ├── web-landmark -> ../../sharegpt4v/data/web-landmark
    └── wikiart -> ../../sharegpt4v/data/wikiart

Download

Text data

wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip
unzip ./playground.zip

Image data
1. AI2D (ai2d): download url
2. ChartQA (chartqa): download url
3. DocVQA (docvqa): train, val, test
4. DVQA (dvqa): download url
5. SynthDoG-EN (synthdog-en): download url
6. GeoQA+ (geoqa+): download url
7. llava, coco, gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
8. sam, share_textvqa, web-celebrity, web-landmark, wikiart: Please refer to the preparation of ShareGPT4V dataset.

Training

LLaVA-LLama-3-8B

Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/)

NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024

Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune/)

NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2 --seed 1024

LLaVA-LLama-3-8B-v1.1 (Recommended)

Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/)

NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024

Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune/)

NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024

Singlg card?

XTuner also supports single-card training for LLaVA-Llama-3-8B (Youth Edition), requiring only a single card with 20GB to complete the entire process of multi-modal training.

Pretrain (saved by default in ./work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/)

xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain --deepspeed deepspeed_zero2 --seed 1024

Fine-tune (saved by default in ./work_dirs/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune/)

xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024

Model Conversion (and Merge)

Step 0. Convert `.pth` file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)

After training, we will obtain a set of weights (i.e., iter_xxx.pth), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.

xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner

At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting.
It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.

./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── projector
│   ├── config.json
│   ├── configuration_projector.py
│   ├── modeling_projector.py
│   └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── visual_encoder_adapter
    ├── adapter_config.json
    ├── adapter_model.safetensors
    └── README.md

At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by

xtuner chat ./iter_39620_xtuner \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava ./iter_39620_xtuner \
  --prompt-template llama3_chat \
  --image $IMAGE_PATH

and in MMBench evaluation, by

xtuner mmbench ./iter_39620_xtuner \
  --visual-encoder openai/clip-vit-large-patch14-336 \
  --llava ./iter_39620_xtuner \
  --prompt-template llama3_chat \
  --data-path $DATA_PATH \
  --work-dir $RESULT_PATH

Here, $DATA_PATH refers to one of the mmbench datasets. You can download the expected data by

wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv

Step 1. Merge ViT LoRA into the original ViT

Because LoRA fine-tuning is applied to ViT during the fine-tuning, it is necessary to first merge LoRA into ViT.

xtuner convert merge openai/clip-vit-large-patch14-336 ./iter_39620_xtuner/visual_encoder_adapter ./iter_39620_visual_encoder --is-clip

Step 2. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format

The official LLaVA format is structured similarly to the architecture of the liuhaotian/llava-v1.5-7b model.
The HuggingFace LLaVA format is structured similarly to the architecture of the llava-hf/llava-1.5-7b-hf model.

To official LLaVA format (xtuner/llava-llama-3-8b-v1_1-hf)

We can utilize the following command to obtain the LLaVA model in the official LLaVA format.

python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava

Here, the converted LLaVA model in official LLaVA format is saved to ./iter_39620_llava.

./iter_39620_llava
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

To HuggingFace LLaVA format (xtuner/llava-llama-3-8b-v1_1-transformers)

We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.

python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner --vision_model_id ./iter_39620_visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf

Here, the converted LLaVA model in HuggingFace LLaVA format is saved to ./iter_39620_hf.

./iter_39620_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json

Chat

XTuner LLaVA format docs
Official LLaVA format docs
HuggingFace LLaVA format docs
GGUF format docs

Deployment

LMDeploy now supports the deployment of official LLaVA format models (e.g.,xtuner/llava-llama-3-8b-v1_1-hf). For specifics, please refer to here.

No Description

Python Markdown other

36994684+LZHgrla@users.noreply.github.com 41630003+HIT-cwh@users.noreply.github.com 67539920+pppppM@users.noreply.github.com 72799392+xiaohangguo@users.noreply.github.com gjf_mail@126.com 34935911+KooSung@users.noreply.github.com 121340550+amulil@users.noreply.github.com 88702197+humu789@users.noreply.github.com humu@pjlab.org.cn khj.application@aliyun.com 1286304229@qq.com huanghaian@sensetime.com eltociear@gmail.com 95841578+JianxinDong@users.noreply.github.com wangnu_043@126.com 108643365+LKJacky@users.noreply.github.com 54879512+PommesPeter@users.noreply.github.com lyuchqi@gmail.com ws11579@gmail.com wolfsonliu@163.com dele.zhenwu@gmail.com 75657629+fanqiNO1@users.noreply.github.com 30570937+gzlong96@users.noreply.github.com 85681332+maxchiron@users.noreply.github.com 106524776+ooooo-create@users.noreply.github.com

How to access data resources in code

README.md

LLaVA-Llama-3-8B

Results

Resources

Data Preparation

LLaVA dataset

File structure

Pretrain

Finetune

ShareGPT4V dataset

File structure

Download

InternVL-SFT

File structure

Download

Training

LLaVA-LLama-3-8B

LLaVA-LLama-3-8B-v1.1 (Recommended)

Singlg card?

Model Conversion (and Merge)

Step 0. Convert .pth file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)

Step 1. Merge ViT LoRA into the original ViT

Step 2. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format

To official LLaVA format (xtuner/llava-llama-3-8b-v1_1-hf)

To HuggingFace LLaVA format (xtuner/llava-llama-3-8b-v1_1-transformers)

Chat

Deployment

Contributors (25+) All

Step 0. Convert `.pth` file to LLaVA model in xtuner format (xtuner/llava-llama-3-8b-v1_1)

Contributors (25+)
All