A cutting-edge foundation for your very own LLM.
World-Class Foundational Model,Contributes to Chinese-Style Innovation
🌐 TigerBot • 🤗 Hugging Face • 💻ModelScope
English | Chinese
[12/29/2023] Tigerbot has published our technical report, to share our approach with technical details and experiences in application deployment 🔥 [paper]
[12/08/2023] Tigerbot family releases updated models - bigger and better 🔥 [Model Download][Evaluation]
[10/19/2023] Long(16k)-Tigerbot Released
[9/27/2023] Tigerbot-70b-chat-api released function calling capability: [tech report][tigerbot-api]
[9/26/2023] Tigerbot-70b-chat(v3) and Tigerbot-13b-chat(v4) updated: [Model Download]
[9/15/2023] Tigerbot-70b-chat(v2) and Tigerbot-13b-chat(v3) updated: [Model Download]
[9/06/2023] Tigerbot-70b released with open source and free commercial usage: [paper][Model Download]🔥
[8/25/2023] TigerBot updates the 13b-base model: [Model Download][Evaluation]
[8/21/2023] TigerBot releases updated 7b and 13b base/chat models: [Model Download][Evaluation]
[8/19/2023] TigerBot inference (tigerbot.com && tigerbot-api) enable TGI, achieving 3x QPS and 2x response speed.
https://github.com/TigerResearch/TigerBot/assets/32117316/0a8c11b9-6a10-4e37-80e8-45b482e76c51
[8/08/2023] TigerBot 2023.08 (V3) release: TigerBot is pleased to announce the release of the TigerBot-13B large model. Based on Llama-2 and with TigerBot's accumulated technology and data, this model not only maintains Llama-2's excellent English abilities but also fills the gap in its Chinese abilities, surpassing Llama-2 by 49% in various mainstream Chinese tasks. In comparison with other open source similar models, it is competitive. 🔥 [paper]
python infer.py --model_path TigerResearch/tigerbot-13b-chat
. [Evaluation][huggingface][8/03/2023] TigerBot is compatible with the OpenAI interface. [tigerbot-api]
[7/26/2023] TigerBot opens its search API [tigerbot-api]
[7/08/2023] TigerBot 2023.07 (V2) release [paper] 🔥
tigerbot-7b-base (v2), Fully pretrained on 1.5TB high-quality data (4 weeks of training time and ~3 million dollars in compute cost), it outperforms the equivalent Bloom and Llama models on both Chinese and English public datasets by 15-30%. [Evaluation][Model Download]
tigerbot-7b-sft (v2), Based on the base-v2 model and trained on 20 million/20G high-quality cleaned and aligned data, it outperforms the previous SFT model (sft-v1) on 9 public corpus evaluations by 9.3%. [Evaluation][Model Download]
The new model can be loaded using the following code:
import transformers
# If you have downloaded the old version, you need to specify `force_download=True` to avoid using the old cache.
model_sft = transformers.AutoModelForCausalLM.from_pretrained('TigerResearch/tigerbot-7b-sft', force_download=True)
model_base = transformers.AutoModelForCausalLM.from_pretrained('TigerResearch/tigerbot-7b-base', force_download=True)
We are hosting internet plugin which enables web browsing with tigerbot. Tigerbot utilizes some mainstream search engines and some web tools (like weather, stock, calculator) to navigate results and interact with websites. Meanwhile , you can use tigerbot chat-api with internet search switch. [TigerBot with search mode (default off) :earth_asia:][paper]
You can use tigerbot chat-api with streaming switch [TigerBot][TigerBot-API]
New features in tigerbot-api, including LLM (chat, plugin, finetune), text (embedding, summarization, pdf2text), vision (text2image) [TigerBot-API]
[6/27/2023] PEFT TigerBot with QLoRA: finetune a tigerbot-7b-sft model on single RTX3090 with qlora, speeds up by 16 times and reduces GPI3/4, which also preventing overfitting on downstream data[code] [paper][Model Download]
conda create --name tigerbot python=3.8
conda activate tigerbot
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
git clone https://github.com/TigerResearch/TigerBot
cd TigerBot
pip install -r requirements.txt
Model | Version | Architecture | Disk size (GB) | Note |
---|---|---|---|---|
tigerbot-70b-base | v2 [🤗][🤖] | llama-2 | 129 | From llama-2-70b weights |
v1 [🤗][🤖] | llama-2 | 129 | From llama-2-70b weights | |
tigerbot-70b-chat | v4-4k [🤗][🤖] | llama-2 | 129 | From tigerbot-70b-base v2 |
v4 [🤗][🤖] | llama-2 | 129 | From tigerbot-70b-base v2 | |
v3 [🤗][🤖] | llama-2 | 129 | From tigerbot-70b-base v1 | |
v2 [🤗][🤖] | llama-2 | 129 | From tigerbot-70b-base v1 | |
v1 [🤗] | llama-2 | 129 | From tigerbot-70b-base v1 | |
tigerbot-70b-chat-4bit | v4 [🤗] | llama-2 | 37 | From tigerbot-70b-chat v4 |
v3 [🤗] | llama-2 | 37 | From tigerbot-70b-chat v3 | |
v2 [🤗] | llama-2 | 37 | From tigerbot-70b-chat v2 | |
v1 [🤗] | llama-2 | 37 | From tigerbot-70b-chat v1 | |
tigerbot-13b-base | v3 [🤗][🤖] | llama-2 | 26.6 | From llama-2-13b weights |
v2 [🤗][🤖] | llama-2 | 26.6 | From llama-2-13b weights | |
v1 [🤗] | llama-2 | 26.6 | From llama-2-13b weights | |
tigerbot-13b-chat | v5-4k [🤗][🤖] | llama-2 | 26.6 | From tigerbot-13b-base v3 |
v5 [🤗][🤖] | llama-2 | 26.6 | From tigerbot-13b-base v3 | |
v4 [🤗][🤖] | llama-2 | 26.6 | From tigerbot-13b-base v2 | |
v3 [🤗][🤖] | llama-2 | 26.6 | From tigerbot-13b-base v2 | |
v2 [🤗] | llama-2 | 26.6 | From tigerbot-13b-base v2 | |
v1 [🤗] | llama-2 | 26.6 | From tigerbot-13b-base v1 | |
tigerbot-13b-chat-4bit | v5 [🤗] | llama-2 | 11.5 | From tigerbot-13b-chat v5-4k |
v4 [🤗] | llama-2 | 11.5 | From tigerbot-13b-chat v4 | |
tigerbot-7b-base | v3 [🤗][🤖] | llama-2 | 13.9 | From llama-2-7b weights |
v2 [🤗] | bloom | 16.2 | From bloom weights | |
v1 [🤗] | bloom | 16.2 | From bloom weights | |
tigerbot-7b-chat | v3 [🤗][🤖] | llama-2 | 13.9 | From tigerbot-7b-base v3 |
v2 [🤗] | bloom | 16.2 | From tigerbot-7b-base v2 | |
v1 [🤗] | bloom | 16.2 | From tigerbot-7b-base v1 | |
tigerbot-7b-chat-8bit | v3 [🤗] | llama-2 | 10.8 | From tigerbot-7b-chat v3 |
tigerbot-7b-chat-4bit | v3 [🤗] | llama-2 | 6.5 | From tigerbot-7b-chat v3 |
tigerbot-180b-base | v2 [🤗][🤖] | bloom | 347.6 | From bloom weights |
tigerbot-180b-chat | v2 [🤗][🤖] | bloom | 347.6 | From tigerbot-180b-chat v2 |
v1 [🤗] | bloom | 347.6 | From bloom weights |
CUDA_VISIBLE_DEVICES=0 python infer.py --model_path tigerbot-13b-chat --max_input_length 1024 --max_generate_length 1024 --streaming True
Parameters:
--model_path
: Model path--model_type=chat
: base/chat--max_input_length=1024
: Maximum input length--max_generate_length=1024
: Maximum output length--rope_scaling=None
: Length extrapolation method (either "dynamic" or "yarn" supported now)--rope_factor=8.0
: Extrapolation parameter--streaming
: Streaming outputYou can infer with command line. Input clear
to clean history and input exit
to stop it.
export PYTHONPATH='./' ; export CUDA_VISIBLE_DEVICES=0 ; streamlit run apps/web_demo.py -- --model_path tigerbot-13b-chat
Parameters are the same as CLI
Both CLI and WebPage are demo versions. TGI has implemented engineering features such as mixed batch and request queue. If there are a large number of inference requirements, it is recommended to provide services through the TGI image.
docker run --gpus '"device=0,1,2,3"' -d -p 8080:80 \
-v PATH-TO-MODEL-DIR:/model ghcr.io/huggingface/text-generation-inference:1.1.1 \
--model-id /model --max-total-tokens=1024 --max-input-length=1024 \
--max-batch-prefill-tokens=1024
Please choose suitable parameters based on the model size and hardware situation. Generally speaking, 7B/13B require A100 40G * 1, and 70B require A100 * 4.
Note that for TGI deployment services, the generation control parameters need to be controlled in each request.
Use ExLLaMaV2 to load [TigerResearch/tigerbot-70b-chat-v4-4bit-exl2] for inference, which will increase inference speed.
# Install ExLLaMaV2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
# Start inference
CUDA_VISIBLE_DEVICES=0 python other_infer/exllamav2_hf_infer.py --model_path ${MODEL_PATH}
The MODEL_PATH
is the path to the quantized model, such as TigerResearch/tigerbot-70b-chat-v4-4bit-exl2
To use the above quantization method, please upgrade packages such as transformers and bitsandbytes to the latest version (Currently, transformers==4.33.1 and bitsandbytes==0.41.1 can be used normally)
pip install -U transformers bitsandbytes
This method is for online quantization and inference.
CUDA_VISIBLE_DEVICES=0 python other_infer/quant_infer.py --model_path ${MODEL_DIR} --wbit 8
Install DeepSpeed
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log
Edit TORCH_CUDA_ARCH_LIST to insert the code for the architectures of the GPU cards you intend to use.
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
So if you get 8, 0, then use TORCH_CUDA_ARCH_LIST="8.0".
Starting the training of tigerbot-7b
requires at least 1 x A100 (40GB), and starting tigerbot-180b
requires at least 16 x A100 (40GB)
deepspeed \
--include="localhost:0,1,2,3" \
./train_clm.py \
--deepspeed ./ds_config/ds_config_zero3.json \
--model_name_or_path TigerResearch/tigerbot-7b-base \
--dataset_name TigerResearch/dev_pretrain \
--do_train \
--output_dir ./ckpt-clm \
--overwrite_output_dir \
--preprocess_num_workers 8 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--evaluation_strategy steps \
--eval_steps 10 \
--bf16 True \
--save_strategy steps \
--save_steps 10 \
--save_total_limit 2 \
--logging_steps 10 \
--tf32 True \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2
deepspeed \
--include="localhost:0,1,2,3" \
./train_sft.py \
--deepspeed ./ds_config/ds_config_zero3.json \
--model_name_or_path TigerResearch/tigerbot-7b-base \
--dataset_name TigerResearch/dev_sft \
--do_train \
--output_dir ./ckpt-sft \
--overwrite_output_dir \
--preprocess_num_workers 8 \
--num_train_epochs 5 \
--learning_rate 1e-5 \
--evaluation_strategy steps \
--eval_steps 10 \
--bf16 True \
--save_strategy steps \
--save_steps 10 \
--save_total_limit 2 \
--logging_steps 10 \
--tf32 True \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2
We use classic benchmarks for automatic evaluation on 13 tasks, covering code, common-sense reasoning, reading comprehension, math, and natural language understanding. We build an automatic evaluation system based on opencompass (thank for @opencompass)
# Installation
cd opencompass
pip install -e .
# Download dataset to the data/ directory
wget https://github.com/InternLM/opencompass/releases/download/0.1.1/OpenCompassData.zip
unzip OpenCompassData.zip
#Run evaluation task:
CUDA_VISIBLE_DEVICES=0,1,2 python run.py configs/eval_tigerbot_13b.py -w outputs/tigerbot-13b-base
The overall score is the average of scores from various tasks
Evaluation results for the chat model:
Evaluation results for the base model:
We collected data from Chinese books, the internet, and encyclopedia-type data based on the distribution of GPT3 pretraining data, and filtered the data through source quality control and tf-idf soft deduplication. From 20TB of data, we filtered down to 2TB, maintaining the proportion of language and categories. On this basis, we randomly sampled 100G of data and released it open source.
English Pretraining Corpus - 51G [hugging face]
Type | Disk | Source |
---|---|---|
zh-book | 12G | TigerBot |
zh-webtext | 25G | TigerBot |
zh-baike | 19G | TigerBot |
en-book | 22G | Public |
en-web | 6.9G | Public |
en-wiki | 22G | Public |
Total | 106G |
Distribution of Pre-training Data
The data collection strategy used for fine-tuning the model involves the following:
a. Summarize 10 categories and 120 sub-tasks based on the natural distribution of user instructions, including tasks such as factual questioning, open-ended creation, syntax analysis, and code editing.
b. Self-instruct: Refer to the Alpaca self-instruct method to expand the seed tasks in both Chinese and English, adding some culturally-specific questions. Based on this, generate 2 million Chinese (0.5 million open-sourced) and 0.1 million English (50k open-sourced) tasks.
c. Human-labeling: Organize and process question-answer datasets based on human writing and answer collection, as well as web searches. Identify the [self-developed] subset in the open source list, and release some data for this subset.
d. Open-source data cleaning: Clean data based on various public datasets, including [self-developed *] datasets that are developed based on secondary development of raw data and [open-source] datasets that typically contain relatively organized question-answer pairs for simple cleaning.
e. The overall distribution of data aligns with the natural distribution of user instructions.
a. Filtering - sensitive words rule: Based on an accumulated sensitive word library, remove sensitive words related to politics, pornography, violence, terrorism, etc. from the dataset;
b. Filtering - invalid input/output rule: This rule mainly focuses on removing specific issues related to the Alpaca Self-Instruct method. Separate rules are established for inputs and outputs to filter out invalid items; for example, invalid inputs include "" and invalid outputs include "[image]".
c. Cleaning - keyword rules: Replace data based on a compiled list of keywords or regular expressions, including removing special characters, non-visible characters, tags, converting between traditional and simplified Chinese characters, etc.;
d. Cleaning - special logic rules: These rules are used to clean specific issues in the dataset such as duplicate instructions and input/output pairs as follows:
{"instruction": "Describe how to make a red-cooked pork dish. Please provide the ingredients and detailed steps.", "input": "Please describe how to make a red-cooked pork dish and provide the ingredients and detailed steps.", ...}
Type | Language | Dataset | Number | Source |
---|---|---|---|---|
alpaca-zh | zh | tigerbot-alpaca-zh-0.5m | 500K | TigerBot |
wiki-qa | zh | tigerbot-wiki-qa-1k | 1K | TigerBot |
book-qa | zh | tigerbot-book-qa-1k | 1K | TigerBot |
riddle-qa | zh | tigerbot-riddle-qa-1k | 1K | TigerBot |
mrc | zh | tigerbot-superclue-c3-zh-5k | 5K | TigerBot |
HC3-qa | zh | tigerbot-HC3-zh-12k | 12K | Public |
zhihu-qa | zh | tigerbot-zhihu-zh-10k | 10K | Public |
alpaca-en | en | tigerbot-alpaca-en-50k | 50K | TigerBot |
brainstorm | en | tigerbot-dolly-Brainstorming-en-1.7k | 1.7K | Public |
classify | en | tigerbot-dolly-Classification-en-2k | 2K | Public |
code | en | tigerbot-kaggle-leetcodesolutions-en-2k | 2K | TigerBot |
recipe | en | tigerbot-kaggle-recipes-en-2k | 2K | Public |
medical-note | en | tigerbot-mt-note-generation-en | 0.45K | Public |
multi-run | en | tigerbot-OIG-multichat-en-50k | 50K | TigerBot |
general | en | tigerbot-stackexchange-qa-en-0.5m | 500K | Public |
wiki-qa | en | tigerbot-wiki-qa-bart-en-10k | 10K | Public |
youtube-howto | en | tigerbot-youtube-howto-en-50k | 50K | Public |
Total | 1200K |
More datasets are being organized and released continuously...
Open up data related to finance, law, and encyclopedia fields as external data sources for rethink
Type | Number |
---|---|
Finance-Research | 5K |
Finance-Earning | 1K |
Law | 550K |
Wiki | 100K |
021-63888086
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》