-
--model_type
: Represents the selected model type, default is None
. model_type
specifies the default lora_target_modules
, template_type
, and other information for the corresponding model. You can fine-tune by specifying only model_type
. The corresponding model_id_or_path
will use default settings, and the model will be downloaded from ModelScope and use the default cache path. One of model_type and model_id_or_path must be specified. You can see the list of available model_type
here. You can set the USE_HF
environment variable to control downloading models and datasets from the HF Hub, see HuggingFace Ecosystem Compatibility Documentation.
-
--model_id_or_path
: Represents the model_id
in the ModelScope/HuggingFace Hub or a local path for the model, default is None
. If the provided model_id_or_path
has already been registered, the model_type
will be inferred based on the model_id_or_path
. If it has not been registered, both model_type
and model_id_or_path
must be specified, e.g. --model_type <model_type> --model_id_or_path <model_id_or_path>
.
-
--model_revision
: The version number corresponding to model_id
on ModelScope Hub, default is None
. If model_revision
is None
, use the revision registered in MODEL_MAPPING
. Otherwise, force use of the model_revision
passed from command line.
-
--local_repo_path
: Some models need github repo, which will be cloned when loading the model. This parameter tells the existing local repo to avoid the clone behaviour. These models are:
- mPLUG-Owl model:
https://github.com/X-PLUG/mPLUG-Owl
- DeepSeek-VL model:
https://github.com/deepseek-ai/DeepSeek-VL
- YI-VL model:
https://github.com/01-ai/Yi
- LLAVA model:
https://github.com/haotian-liu/LLaVA.git
-
--sft_type
: Fine-tuning method, default is 'lora'
. Options include: 'lora', 'full', 'longlora', 'qalora'. If using qlora, you need to set --sft_type lora --quantization_bit 4
.
-
--packing
: pack the dataset length to max-length
, default False
.
-
--freeze_parameters
: When sft_type is set to 'full', freeze the bottommost parameters of the model. Range is 0. ~ 1., default is 0.
. This provides a compromise between lora and full fine-tuning.
-
--additional_trainable_parameters
: In addition to freeze_parameters, only allowed when sft_type is 'full', default is []
. For example, if you want to train embedding layer in addition to 50% of parameters, you can set --freeze_parameters 0.5 --additional_trainable_parameters transformer.wte
, all parameters starting with transformer.wte
will be activated.
-
--tuner_backend
: Backend support for lora, qlora, default is 'peft'
. Options include: 'swift', 'peft', 'unsloth'.
-
--template_type
: Type of dialogue template used, default is 'AUTO'
, i.e. look up template
in MODEL_MAPPING
based on model_type
. Available template_type
options can be found in TEMPLATE_MAPPING.keys()
.
-
--output_dir
: Directory to store ckpt, default is 'output'
. We will append model_type
and fine-tuning version number to this directory, allowing users to do multiple comparative experiments on different models without changing the output_dir
command line argument. If you don't want to append this content, specify --add_output_dir_suffix false
.
-
--add_output_dir_suffix
: Default is True
, indicating that a suffix of model_type
and fine-tuning version number will be appended to the output_dir
directory. Set to False
to avoid this behavior.
-
--ddp_backend
: Backend support for distributed training, default is None
. Options include: 'nccl', 'gloo', 'mpi', 'ccl'.
-
--seed
: Global seed, default is 42
. Used to reproduce training results.
-
--resume_from_checkpoint
: For resuming training from checkpoint, default is None
. You can set this to the path of a checkpoint, e.g. 'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'
, to resume training.
-
--dtype
: torch_dtype when loading base model, default is 'AUTO'
, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if MODEL_MAPPING
specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
-
--dataset
: Used to select the training dataset, default is []
. You can see the list of available datasets here. If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: --dataset alpaca-en,alpaca-zh
or --dataset alpaca-en alpaca-zh
. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: [HF or MS:]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]
. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the Customizing and Extending Datasets document
- Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#200', 'HF::jd-sentiment-zh#200' (the default hub used is controlled by the
USE_UF
environment variable, default is MS).
- More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#20', 'HF::llm-wizard/alpaca-gpt4-data-zh#20', hurner/alpaca-gpt4-data-zh#20, HF::shibing624/alpaca-zh#20. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use
SmartPreprocessor
, support 4 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the Customizing and Extending Datasets document.
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
-
--val_dataset
: Specify separate validation datasets with the same format of the dataset
argument. If using val_dataset
, the dataset_test_ratio
will be ignored.
-
--dataset_seed
: Seed for dataset processing, default is 42
. Exists as random_state, does not affect global seed.
-
--dataset_test_ratio
: Ratio for splitting subdataset into train and validation sets, default is 0.01
.
-
--train_dataset_sample
: The number of samples for the training dataset, default is -1
, which means using the complete training dataset for training. This parameter is deprecated, please use --dataset {dataset_name}#{dataset_sample}
instead.
-
--val_dataset_sample
: Sampling for the validation dataset, default is None
, which automatically selects an appropriate number of samples for validation. If you specify -1
, it uses the complete validation dataset for validation. This parameter is deprecated, and the number of samples in the validation dataset is fully controlled by dataset_test_ratio.
-
--system
: System used in dialogue template, default is None
, i.e. use the model's default system. If set to '', no system is used.
-
--max_length
: Maximum token length, default is 2048
. Avoids OOM issues caused by individual overly long samples. When --truncation_strategy delete
is specified, samples exceeding max_length will be deleted. When --truncation_strategy truncation_left
is specified, the leftmost tokens will be truncated: input_ids[-max_length:]
. If set to -1, no limit.
-
--truncation_strategy
: Default is 'delete'
which removes sentences exceeding max_length from dataset. 'truncation_left'
will truncate excess text from the left, which may truncate special tokens and affect performance, not recommended.
-
--check_dataset_strategy
: Default is 'none'
, i.e. no checking. If training an LLM model, 'warning'
is recommended as data check strategy. If your training target is sentence classification etc., setting to 'none'
is recommended.
-
--custom_train_dataset_path
: Default value is []
. This parameter has been deprecated, please use --dataset {dataset_path}
.
-
--custom_val_dataset_path
: Default value is []
. This parameter has been deprecated. There is no longer a distinction between training and validation datasets, and the split is now unified using dataset_test_ratio
. Please use --dataset {dataset_path}
.
-
--self_cognition_sample
: The number of samples for the self-cognition dataset. Default is 0
. If you set this value to >0, you need to specify --model_name
and --model_author
at the same time. This parameter has been deprecated, please use --dataset self-cognition#{self_cognition_sample}
instead.
-
--model_name
: Default value is [None, None]
. If self-cognition dataset sampling is enabled (i.e., specifying --dataset self-cognition
or self_cognition_sample>0), you need to provide two values, representing the Chinese and English names of the model, respectively. For example: --model_name 小黄 'Xiao Huang'
. If you want to learn more, you can refer to the Self-Cognition Fine-tuning Best Practices.
-
--model_name
: Default is [None, None]
. If self-cognition dataset sampling is enabled (i.e. self_cognition_sample>0), you need to pass two values, representing the model's Chinese and English names respectively. E.g. --model_name 小黄 'Xiao Huang'
.
-
--model_author
: Default is [None, None]
. If self-cognition dataset sampling is enabled, you need to pass two values, representing the author's Chinese and English names respectively. E.g. --model_author 魔搭 ModelScope
.
-
--quant_method
: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
-
--quantization_bit
: Specifies whether to quantize and number of quantization bits, default is 0
, i.e. no quantization. To use 4bit qlora, set --sft_type lora --quantization_bit 4
.Hqq support 1,2,3,4,8bit, bnb support 4,8bit
-
--hqq_axis
: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is 0
-
--hqq_dynamic_config_path
: Parameters for dynamic configuration. The key is the name tag of the layer and the value is a quantization config. If set, each layer specified by its id will use its dedicated quantization configuration.ref
-
--bnb_4bit_comp_dtype
: When doing 4bit quantization, we need to dequantize during model forward and backward passes. This specifies the torch_dtype after dequantization. Default is 'AUTO'
, i.e. consistent with dtype
. Options: 'fp16', 'bf16', 'fp32'. Has no effect when quantization_bit is 0.
-
--bnb_4bit_quant_type
: Quantization method for 4bit quantization, default is 'nf4'
. Options: 'nf4', 'fp4'. Has no effect when quantization_bit is 0.
-
--bnb_4bit_use_double_quant
: Whether to enable double quantization for 4bit quantization, default is True
. Has no effect when quantization_bit is 0.
-
--bnb_4bit_quant_storage
: Default vlaue None
.This sets the storage type to pack the quanitzed 4-bit prarams. Has no effect when quantization_bit is 0.
-
--lora_target_modules
: Specify lora modules, default is ['DEFAULT']
. If lora_target_modules is passed 'DEFAULT'
or 'AUTO'
, look up lora_target_modules
in MODEL_MAPPING
based on model_type
(default specifies qkv). If passed 'ALL'
, all Linear layers (excluding head) will be specified as lora modules. If passed 'EMBEDDING'
, Embedding layer will be specified as lora module. If memory allows, setting to 'ALL' is recommended. You can also set ['ALL', 'EMBEDDING']
to specify all Linear and embedding layers as lora modules. This parameter only takes effect when sft_type
is 'lora'.
-
--lora_rank
: Default is 8
. Only takes effect when sft_type
is 'lora'.
-
--lora_alpha
: Default is 32
. Only takes effect when sft_type
is 'lora'.
-
--lora_dropout_p
: Default is 0.05
, only takes effect when sft_type
is 'lora'.
-
--lora_bias_trainable
: Default is 'none'
, options: 'none', 'all'. Set to 'all'
to make all biases trainable.
-
--lora_modules_to_save
: Default is []
. If you want to train embedding, lm_head, or layer_norm, you can set this parameter, e.g. --lora_modules_to_save EMBEDDING LN lm_head
. If passed 'EMBEDDING'
, Embedding layer will be added to lora_modules_to_save
. If passed 'LN'
, RMSNorm
and LayerNorm
will be added to lora_modules_to_save
.
-
--lora_dtype
: Default is 'AUTO'
, specifies dtype for lora modules. If AUTO
, follow dtype of original module. Options: 'fp16', 'bf16', 'fp32', 'AUTO'.
-
--use_dora
: Default is False
, whether to use DoRA
.
-
--use_rslora
: Default is False
, whether to use RS-LoRA
.
-
--neftune_noise_alpha
: The noise coefficient added by NEFTune
can improve performance of instruction fine-tuning, default is None
. Usually can be set to 5, 10, 15. See related paper.
-
--neftune_backend
: The backend of NEFTune
, default uses transformers
library, may encounter incompatibility when training VL models, in which case it's recommended to specify as swift
.
-
--gradient_checkpointing
: Whether to enable gradient checkpointing, default is True
. This can be used to save memory, although it slightly reduces training speed. Has significant effect when max_length and batch_size are large.
-
--deepspeed
: Specifies the path to the deepspeed configuration file or directly passes in configuration information in json format, default is None
, i.e. deepspeed is not enabled. Deepspeed can save memory. We have written a default ZeRO-2 configuration file, ZeRO-3 configuration file. You only need to specify 'default-zero2' to use the default zero2 config file; specify 'default-zero3' to use the default zero3 config file.
-
--batch_size
: Batch_size during training, default is 1
. Increasing batch_size can improve GPU utilization, but won't necessarily improve training speed, because within a batch, shorter sentences need to be padded to the length of the longest sentence in the batch, introducing invalid computations.
-
--eval_batch_size
: Batch_size during evaluation, default is None
, i.e. set to 1 when predict_with_generate
is True, set to batch_size
when False.
-
--num_train_epochs
: Number of epochs to train, default is 1
. If max_steps >= 0
, this overrides num_train_epochs
. Usually set to 3 ~ 5.
-
--max_steps
: Max_steps for training, default is -1
. If max_steps >= 0
, this overrides num_train_epochs
.
-
--optim
: Default is 'adamw_torch'
.
-
--learning_rate
: Default is None
, i.e. set to 1e-4 if sft_type
is lora, set to 1e-5 if sft_type
is full.
-
--weight_decay
: Default is 0.01
.
-
--gradient_accumulation_steps
: Gradient accumulation, default is None
, set to math.ceil(16 / self.batch_size / world_size)
. total_batch_size = batch_size * gradient_accumulation_steps * world_size
.
-
--max_grad_norm
: Gradient clipping, default is 0.5
.
-
--predict_with_generate
: Whether to use generation for evaluation, default is False
. If set to False, evaluate using loss
. If set to True, evaluate using ROUGE-L
and other metrics. Generative evaluation takes a long time, choose carefully.
-
--lr_scheduler_type
: Default is 'linear'
, options: 'linear', 'cosine', 'constant', etc.
-
--warmup_ratio
: Proportion of warmup in total training steps, default is 0.05
.
-
--eval_steps
: Evaluate every this many steps, default is 50
.
-
--save_steps
: Save every this many steps, default is None
, i.e. set to eval_steps
.
-
--save_only_model
: Whether to save only model parameters, without saving intermediate states needed for checkpoint resuming, default is None
, i.e. if sft_type
is 'lora' and not using deepspeed (deepspeed
is None
), set to False, otherwise set to True (e.g. using full fine-tuning or deepspeed).
-
--save_total_limit
: Number of checkpoints to save, default is 2
, i.e. save best and last checkpoint. If set to -1, save all checkpoints.
-
--logging_steps
: Print training information (e.g. loss, learning_rate, etc.) every this many steps, default is 5
.
-
--dataloader_num_workers
: Default is 1
.
-
--push_to_hub
: Whether to sync push trained checkpoint to ModelScope Hub, default is False
.
-
--hub_model_id
: Model_id to push to on ModelScope Hub, default is None
, i.e. set to f'{model_type}-{sft_type}'
. You can set this to model_id or repo_name. We will infer user_name based on hub_token. If the remote repository to push to does not exist, a new repository will be created, otherwise the previous repository will be reused. This parameter only takes effect when push_to_hub
is set to True.
-
--hub_token
: SDK token needed for pushing. Can be obtained from https://modelscope.cn/my/myaccesstoken, default is None
, i.e. obtained from environment variable MODELSCOPE_API_TOKEN
. This parameter only takes effect when push_to_hub
is set to True.
-
--hub_private_repo
: Whether to set the permission of the pushed model repository on ModelScope Hub to private, default is False
. This parameter only takes effect when push__to_hub
is set to True.
-
--push_hub_strategy
: Push strategy, default is 'push_best'
. Options include: 'end', 'push_best', 'push_last', 'checkpoint', 'all_checkpoints'. 'push_best' means when saving weights each time, push and overwrite the best model from before, 'push_last' means when saving weights each time, push and overwrite the last weights from before, 'end' means only push the best model at the end of training. This parameter only takes effect when push_to_hub
is set to True.
-
--test_oom_error
: Used to detect whether training will cause OOM, default is False
. If set to True, will sort the training set in descending order by max_length, easy for OOM testing. This parameter is generally used for testing, use carefully.
-
--disable_tqdm
: Whether to disable tqdm, useful when launching script with nohup
. Default is False
, i.e. enable tqdm.
-
--lazy_tokenize
: If set to False, preprocess all text before trainer.train()
. If set to True, delay encoding text, reducing preprocessing wait and memory usage, useful when processing large datasets. Default is None
, i.e. we intelligently choose based on template type, usually set to False for LLM models, set to True for multimodal models (to avoid excessive memory usage from loading images and audio).
-
--preprocess_num_proc
: Use multiprocessing when preprocessing dataset (tokenizing text). Default is 1
. Same as lazy_tokenize
command line argument, used to solve slow preprocessing issue. But this strategy cannot reduce memory usage, so if dataset is huge, lazy_tokenize
is recommended. Recommended values: 4, 8. Note: When using qwen-audio, this parameter will be forced to 1, because qwen-audio's preprocessing function uses torch's multiprocessing, which will cause compatibility issues.
-
--use_flash_attn
: Whether to use flash attn, default is None
. Installation steps for flash_attn can be found at https://github.com/Dao-AILab/flash-attention. Models supporting flash_attn can be found in LLM Supported Models.
-
--ignore_args_error
: Whether to ignore Error thrown by command line parameter errors, default is False
. Set to True if need to copy code to notebook to run.
-
--check_model_is_latest
: Check if model is latest, default is True
. Set this to False
if you need to train offline.
-
--logging_dir
: Default is None
. I.e. set to f'{self.output_dir}/runs'
, representing path to store tensorboard files.
-
--report_to
: Default is ['tensorboard']
.
-
--acc_strategy
: Default is 'token'
, options include: 'token', 'sentence'.
-
--save_on_each_node
: Takes effect during multi-machine training, default is True
.
-
--save_strategy
: Strategy for saving checkpoint, default is 'steps'
, options include: 'steps', 'epoch', no'.
-
--evaluation_strategy
: Strategy for evaluation, default is 'steps'
, options include: 'steps', 'epoch', no'.
-
--save_safetensors
: Default is True
.
-
--include_num_input_tokens_seen
: Default is False
. Tracks the number of input tokens seen throughout training.
-
--max_new_tokens
: Default is 2048
. This parameter only takes effect when predict_with_generate
is set to True.
-
--do_sample
: Default is True
. This parameter only takes effect when predict_with_generate
is set to True.
-
--temperature
: Default is 0.3
. This parameter only takes effect when do_sample
is set to True. This parameter will be used as default value in deployment parameters.
-
--top_k
: Default is 20
. This parameter only takes effect when do_sample
is set to True. This parameter will be used as default value in deployment parameters.
-
--top_p
: Default is 0.7
. This parameter only takes effect when do_sample
is set to True. This parameter will be used as default value in deployment parameters.
-
--repetition_penalty
: Default is 1.
. This parameter will be used as default value in deployment parameters.
-
--num_beams
: Default is 1
. This parameter only takes effect when predict_with_generate
is set to True.
-
--gpu_memory_fraction
: Default is None
. This parameter aims to run training under a specified maximum available GPU memory percentage, used for extreme testing.
-
--train_dataset_mix_ratio
: Default is 0.
. This parameter defines how to mix datasets for training. When this parameter is specified, it will mix the training dataset with a multiple of train_dataset_mix_ratio
of the general knowledge dataset specified by train_dataset_mix_ds
. This parameter has been deprecated, please use --dataset {dataset_name}#{dataset_sample}
to mix datasets.
-
--train_dataset_mix_ds
: Default is ['ms-bench']
. Used for preventing knowledge forgetting, this is the general knowledge dataset. This parameter has been deprecated, please use --dataset {dataset_name}#{dataset_sample}
to mix datasets.
-
--use_loss_scale
: Default is False
. When taking effect, strengthens loss weight of some Agent fields (Action/Action Input part) to enhance CoT, has no effect in regular SFT scenarios.
-
--custom_register_path
: Default is None
. Pass in a .py
file used to register templates, models, and datasets.
-
--custom_dataset_info
: Default is None
. Pass in the path to an external dataset_info.json
, a JSON string, or a dictionary. Used to register custom datasets. The format example: https://github.com/modelscope/swift/blob/main/swift/llm/data/dataset_info.json
-
device_map_config_path
: Manually configure the model's device map from a local file, defaults to None.
unsloth has no new parameters,you can use the existing parameters to use unsloth:
export parameters inherit from infer parameters, with the following added parameters:
The eval parameters inherit from the infer parameters, and additionally include the following parameters:
app-ui parameters inherit from infer parameters, with the following added parameters:
deploy parameters inherit from infer parameters, with the following added parameters: