kimxiaogen c45b2db8c3 【fix】增加启智平台代码		1 year ago
data	【data】enwik8数据集上传	2 years ago

script	【merge】同步GPU版本代码改动内容，并增加Ascend部分代码	2 years ago

src	【fix】增加启智平台代码	1 year ago

.gitignore	【fix】增加启智平台代码	1 year ago

README.md	【merge】同步GPU版本代码改动内容，并增加Ascend部分代码	2 years ago

README_CN.md	【merge】同步GPU版本代码改动内容，并增加Ascend部分代码	2 years ago

enwik8_base.yaml	【fix】删除多余代码，调通代码	1 year ago

eval.py	【fix】删除多余代码，调通代码	1 year ago

getdata.sh	【merge】同步GPU版本代码改动内容，并增加Ascend部分代码	2 years ago

hccl.sh	【fix】优化matmul算子，修复log计算时间问题	2 years ago

hccl_tools.py	【init】初始化Transformer-xl	2 years ago

lr_of_40w_steps.npy	【fix】针对ascend平台修改版	2 years ago

rank_table_8pcs.json	【done】同步代码	2 years ago

requirements.txt	【init】初始化Transformer-xl	2 years ago

static_lr.py	【fix】删除多余代码，调通代码	1 year ago

train.py	【fix】删除多余代码，调通代码	1 year ago

README.md

Transformer-XL is an improvement to Transformer, mainly to solve the problem of long sequences. At the same time, it combines the advantages of RNN sequence modeling and Transformer's self-attention mechanism, introduces a recurrent mechanism and relative position encoding, uses Transformer's attention module on each segment of the input data, and uses a recurrent mechanism to learn the relationship between consecutive segments. dependencies. And successfully achieved SoTA effect on language modeling datasets such as enwik8 and text8.

Paper: Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.

Model Architecture

The backbone structure of Transformer-XL is Transformer, which adds Recurrence Mechanism and Relative Positional Encoding on the original basis.

Dataset

The following four datasets contain the training dataset and the evaluation dataset

enwik8
text8

The datasets used include enwik8 and text8, with enwik8 containing 100MB of unprocessed Wikipedia text. Similar to enwiki8, text8 also contains 100MB of Wikipedia text, the difference being the removal of characters other than the 26 letters and spaces.

Environment Requirements

Hardware（Ascend/GPU）
- Prepare hardware environment with Ascend or GPU processor.
Framework
- MindSpore
For more information, please check the resources below：
- MindSpore Tutorials
- MindSpore Python API

Quick Start

Running on GPU

After dataset preparation, you can start training and evaluation as follows:

# run training example
bash scripts/run_enwik8_base.sh train [DEVICE_ID]

# run distributed training example
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]

# run evaluation example
bash scripts/run_enwik8_base.sh eval  [DEVICE_ID]

Script Description

Script and Sample Code

.
└─Transformer-XL
  ├─README.md
  ├─scripts
    └─run_enwik8_base.sh
  ├─src
    ├─callback
      ├─eval.py
      ├─flag.py
      └─log.py
    ├─common
      └─ops.py
    ├─loss_fn
      └─ProjectedAdaptiveLogSoftmaxLoss.py
    ├─metric
      └─calc.py
    ├─model
      ├─attn.py
      ├─dataset.py
      ├─embedding.py
      ├─layer.py
      ├─mem_transformer.py
      ├─positionwiseFF.py
      └─vocabulary.py
    ├─model_utils
      ├─config.py
      ├─device_adapter.py
      ├─local_adapter.py
      └─moxing_adapter.py
    ├─utils
      ├─additional_algorithms.py
      ├─dataset_util.py
      ├─nnUtils.py
  ├─default_config.yaml
  ├─hccl_tools.py
  ├─getdata.sh
  ├─eval.py
  └─train.py

Script Parameters

Training Script Parameters

usage:
train.py [--ascend] [--data DATA_PATH]
         [--dataset NAME] [--optim adam]
options:
    --ascend     use ascend
    --data_path  path to dataset file: PATH
    --data       path to dataset file: enwik8
    --optim      optimizer, default is adam

Network Parameters

Parameters for dataset and network (Training/Evaluation):
    n_layer       number of total layers: N, default is 12
    d_model       dimension of model, default is 512
    n_head        number of heads, default is 8
    d_head        head dimension, default is 64
    d_inner       inner dimension in FF, default is 2048
    dropout       global dropout rate: Q, default is 0.1
    dropatt       attention probability dropout rate: Q, default is 0.0
    max_step      maximum of step: N, default is 400000
    tgt_len       number of tokens to predict, default is 512
    mem_len       length of the retained previous heads, default is 512
    eval_tgt_len  number of tokens to predict for evaluation, default is 128
    batch_size    batch size of input dataset: N, default is 22

Parameters for learning rate:
    lr            value of learning rate: Q, default is 0.00025
    warmup_step   steps of the learning rate warm up: N, default is 0

Dataset Preparation

Download the dataset and configure DATA_PATH

Training Process

Set options in default_config.yaml, including loss_scale, learning rate and network hyperparameters.
Run run_enwik8_base.sh for non-distributed training of Transformer-XL model.
```
bash scripts/run_enwik8_base.sh train [DEVICE_ID]
```
Run run_enwik8_base.sh for distributed training of Transformer-XL model.
```
bash scripts/run_enwik8_base.sh train [DEVICE_NUM]
```

Evaluation Process

Set options in default_config.yaml. Make sure the 'data' are set to your own path.

Run eval.py for evaluation of Transformer model.

bash scripts/run_enwik8_base.sh eval [DEVICE_ID]

Model Description

Performance

Training Performance

Parameters	Ascend
Resource	Ascend 910; OS Euler2.8
uploaded Date	18/02/2022 (month/day/year)
MindSpore Version	1.6.0
Dataset	enwik8
Training Parameters	batch_size=22
Optimizer	Adam
Loss Function	Softmax Cross Entropy
BPC Score	1.11
Speed	30ms/batch
Loss	0.78
Params (K)	15.33
Checkpoint for inference	1.45G(.ckpt文件)
Scripts	Transformer scripts

Evaluation Performance

Parameters	Ascend
Resource	Ascend 910; OS Euler2.8
Uploaded Date	18/02/2022 (month/day/year)
MindSpore Version	1.6.0
Dataset	enwik8
batch_size	22
outputs	loss
Loss	0.78
BPC Score	1.11

Description of Random Situation

There are three random situations:

Shuffle of the dataset.
Initialization of some model weights.
Dropout operations.

Some seeds have already been set in train.py to avoid the randomness of dataset shuffle and weight initialization. If you want to disable dropout, please set the corresponding dropout_prob parameter to 0 in default_config.yaml.

ModelZoo Homepage

Please check the official homepage.

Transformer-XL是对Transformer的改进，主要是解决长序列的问题。

Text

How to access data resources in code

README.md

Contents

Transformer_XL Description

Model Architecture

Dataset

Environment Requirements

Quick Start

Script Description

Script and Sample Code

Script Parameters

Training Script Parameters

Network Parameters

Dataset Preparation

Training Process

Evaluation Process

Model Description

Performance

Training Performance

Evaluation Performance

Description of Random Situation

ModelZoo Homepage

Contributors (2)
All

README.md

Contents

Training Script Parameters

Network Parameters

Training Performance

Evaluation Performance

Contributors (2) All

Contributors (2)
All