Contents
Transformer-XL is an improvement to Transformer, mainly to solve the problem of long sequences. At the same time, it combines the advantages of RNN sequence modeling and Transformer's self-attention mechanism, introduces a recurrent mechanism and relative position encoding, uses Transformer's attention module on each segment of the input data, and uses a recurrent mechanism to learn the relationship between consecutive segments. dependencies. And successfully achieved SoTA effect on language modeling datasets such as enwik8 and text8.
Paper: Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.
The backbone structure of Transformer-XL is Transformer, which adds Recurrence Mechanism and Relative Positional Encoding on the original basis.
The following four datasets contain the training dataset and the evaluation dataset
The datasets used include enwik8 and text8, with enwik8 containing 100MB of unprocessed Wikipedia text. Similar to enwiki8, text8 also contains 100MB of Wikipedia text, the difference being the removal of characters other than the 26 letters and spaces.
- Hardware(Ascend/GPU)
- Prepare hardware environment with Ascend or GPU processor.
- Framework
- For more information, please check the resources below:
After dataset preparation, you can start training and evaluation as follows:
# run training example
bash scripts/run_enwik8_base.sh train [DEVICE_ID]
# run distributed training example
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]
# run evaluation example
bash scripts/run_enwik8_base.sh eval [DEVICE_ID]
.
└─Transformer-XL
├─README.md
├─scripts
└─run_enwik8_base.sh
├─src
├─callback
├─eval.py
├─flag.py
└─log.py
├─common
└─ops.py
├─loss_fn
└─ProjectedAdaptiveLogSoftmaxLoss.py
├─metric
└─calc.py
├─model
├─attn.py
├─dataset.py
├─embedding.py
├─layer.py
├─mem_transformer.py
├─positionwiseFF.py
└─vocabulary.py
├─model_utils
├─config.py
├─device_adapter.py
├─local_adapter.py
└─moxing_adapter.py
├─utils
├─additional_algorithms.py
├─dataset_util.py
├─nnUtils.py
├─default_config.yaml
├─hccl_tools.py
├─getdata.sh
├─eval.py
└─train.py
Training Script Parameters
usage:
train.py [--ascend] [--data DATA_PATH]
[--dataset NAME] [--optim adam]
options:
--ascend use ascend
--data_path path to dataset file: PATH
--data path to dataset file: enwik8
--optim optimizer, default is adam
Network Parameters
Parameters for dataset and network (Training/Evaluation):
n_layer number of total layers: N, default is 12
d_model dimension of model, default is 512
n_head number of heads, default is 8
d_head head dimension, default is 64
d_inner inner dimension in FF, default is 2048
dropout global dropout rate: Q, default is 0.1
dropatt attention probability dropout rate: Q, default is 0.0
max_step maximum of step: N, default is 400000
tgt_len number of tokens to predict, default is 512
mem_len length of the retained previous heads, default is 512
eval_tgt_len number of tokens to predict for evaluation, default is 128
batch_size batch size of input dataset: N, default is 22
Parameters for learning rate:
lr value of learning rate: Q, default is 0.00025
warmup_step steps of the learning rate warm up: N, default is 0
- Download the dataset and configure DATA_PATH
-
Set options in default_config.yaml
, including loss_scale, learning rate and network hyperparameters.
-
Run run_enwik8_base.sh
for non-distributed training of Transformer-XL model.
bash scripts/run_enwik8_base.sh train [DEVICE_ID]
-
Run run_enwik8_base.sh
for distributed training of Transformer-XL model.
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]
-
Set options in default_config.yaml
. Make sure the 'data' are set to your own path.
-
Run eval.py
for evaluation of Transformer model.
bash scripts/run_enwik8_base.sh eval [DEVICE_ID]
Training Performance
Parameters |
Ascend |
Resource |
Ascend 910; OS Euler2.8 |
uploaded Date |
18/02/2022 (month/day/year) |
MindSpore Version |
1.6.0 |
Dataset |
enwik8 |
Training Parameters |
batch_size=22 |
Optimizer |
Adam |
Loss Function |
Softmax Cross Entropy |
BPC Score |
1.11 |
Speed |
30ms/batch |
Loss |
0.78 |
Params (K) |
15.33 |
Checkpoint for inference |
1.45G(.ckpt文件) |
Scripts |
Transformer scripts |
Evaluation Performance
Parameters |
Ascend |
Resource |
Ascend 910; OS Euler2.8 |
Uploaded Date |
18/02/2022 (month/day/year) |
MindSpore Version |
1.6.0 |
Dataset |
enwik8 |
batch_size |
22 |
outputs |
loss |
Loss |
0.78 |
BPC Score |
1.11 |
There are three random situations:
- Shuffle of the dataset.
- Initialization of some model weights.
- Dropout operations.
Some seeds have already been set in train.py to avoid the randomness of dataset shuffle and weight initialization. If you want to disable dropout, please set the corresponding dropout_prob parameter to 0 in default_config.yaml.
Please check the official homepage.