Branch: float16

8.6 KiB

Raw Permalink Blame History

Transformer-XL is an improvement to Transformer, mainly to solve the problem of long sequences. At the same time, it combines the advantages of RNN sequence modeling and Transformer's self-attention mechanism, introduces a recurrent mechanism and relative position encoding, uses Transformer's attention module on each segment of the input data, and uses a recurrent mechanism to learn the relationship between consecutive segments. dependencies. And successfully achieved SoTA effect on language modeling datasets such as enwik8 and text8.

Paper: Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.

Model Architecture

The backbone structure of Transformer-XL is Transformer, which adds Recurrence Mechanism and Relative Positional Encoding on the original basis.

Dataset

The following four datasets contain the training dataset and the evaluation dataset

enwik8
text8

The datasets used include enwik8 and text8, with enwik8 containing 100MB of unprocessed Wikipedia text. Similar to enwiki8, text8 also contains 100MB of Wikipedia text, the difference being the removal of characters other than the 26 letters and spaces.

Environment Requirements

Hardware（Ascend/GPU）
- Prepare hardware environment with Ascend or GPU processor.
Framework
- MindSpore
For more information, please check the resources below：
- MindSpore Tutorials
- MindSpore Python API

Quick Start

Running on GPU

After dataset preparation, you can start training and evaluation as follows:

# run training example
bash scripts/run_enwik8_base.sh train [DEVICE_ID]

# run distributed training example
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]

# run evaluation example
bash scripts/run_enwik8_base.sh eval  [DEVICE_ID]

Script Description

Script and Sample Code

.
└─Transformer-XL
  ├─README.md
  ├─scripts
    └─run_enwik8_base.sh
  ├─src
    ├─callback
      ├─eval.py
      ├─flag.py
      └─log.py
    ├─common
      └─ops.py
    ├─loss_fn
      └─ProjectedAdaptiveLogSoftmaxLoss.py
    ├─metric
      └─calc.py
    ├─model
      ├─attn.py
      ├─dataset.py
      ├─embedding.py
      ├─layer.py
      ├─mem_transformer.py
      ├─positionwiseFF.py
      └─vocabulary.py
    ├─model_utils
      ├─config.py
      ├─device_adapter.py
      ├─local_adapter.py
      └─moxing_adapter.py
    ├─utils
      ├─additional_algorithms.py
      ├─dataset_util.py
      ├─nnUtils.py
  ├─default_config.yaml
  ├─hccl_tools.py
  ├─getdata.sh
  ├─eval.py
  └─train.py

Script Parameters

Training Script Parameters

usage:
train.py [--ascend] [--data DATA_PATH]
         [--dataset NAME] [--optim adam]
options:
    --ascend     use ascend
    --data_path  path to dataset file: PATH
    --data       path to dataset file: enwik8
    --optim      optimizer, default is adam

Network Parameters

Parameters for dataset and network (Training/Evaluation):
    n_layer       number of total layers: N, default is 12
    d_model       dimension of model, default is 512
    n_head        number of heads, default is 8
    d_head        head dimension, default is 64
    d_inner       inner dimension in FF, default is 2048
    dropout       global dropout rate: Q, default is 0.1
    dropatt       attention probability dropout rate: Q, default is 0.0
    max_step      maximum of step: N, default is 400000
    tgt_len       number of tokens to predict, default is 512
    mem_len       length of the retained previous heads, default is 512
    eval_tgt_len  number of tokens to predict for evaluation, default is 128
    batch_size    batch size of input dataset: N, default is 22

Parameters for learning rate:
    lr            value of learning rate: Q, default is 0.00025
    warmup_step   steps of the learning rate warm up: N, default is 0

Dataset Preparation

Download the dataset and configure DATA_PATH

Training Process

Set options in default_config.yaml, including loss_scale, learning rate and network hyperparameters.
Run run_enwik8_base.sh for non-distributed training of Transformer-XL model.
```
bash scripts/run_enwik8_base.sh train [DEVICE_ID]
```
Run run_enwik8_base.sh for distributed training of Transformer-XL model.
```
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]
```

Evaluation Process

Set options in default_config.yaml. Make sure the 'data' are set to your own path.

Run eval.py for evaluation of Transformer model.

bash scripts/run_enwik8_base.sh eval [DEVICE_ID]

Model Description

Performance

Training Performance

Parameters	Ascend
Resource	Ascend 910; OS Euler2.8
uploaded Date	18/02/2022 (month/day/year)
MindSpore Version	1.6.0
Dataset	enwik8
Training Parameters	batch_size=22
Optimizer	Adam
Loss Function	Softmax Cross Entropy
BPC Score	1.11
Speed	30ms/batch
Loss	0.78
Params (K)	15.33
Checkpoint for inference	1.45G(.ckpt文件)
Scripts	Transformer scripts

Evaluation Performance

Parameters	Ascend
Resource	Ascend 910; OS Euler2.8
Uploaded Date	18/02/2022 (month/day/year)
MindSpore Version	1.6.0
Dataset	enwik8
batch_size	22
outputs	loss
Loss	0.78
BPC Score	1.11

Description of Random Situation

There are three random situations:

Shuffle of the dataset.
Initialization of some model weights.
Dropout operations.

Some seeds have already been set in train.py to avoid the randomness of dataset shuffle and weight initialization. If you want to disable dropout, please set the corresponding dropout_prob parameter to 0 in default_config.yaml.

ModelZoo Homepage

Please check the official homepage.

8.6 KiB

Raw Permalink Blame History

Contents

Transformer_XL Description

Model Architecture

Dataset

Environment Requirements

Quick Start

Script Description

Script and Sample Code

Script Parameters

Training Script Parameters

Network Parameters

Dataset Preparation

Training Process

Evaluation Process

Model Description

Performance

Training Performance

Evaluation Performance

Description of Random Situation

ModelZoo Homepage

8.6 KiB Raw Permalink Blame History

Contents

Training Script Parameters

Network Parameters

Training Performance

Evaluation Performance

8.6 KiB

Raw Permalink Blame History