Rafael8830 c4a22ad180 train		1 year ago
ascend310_infer	train	1 year ago

scripts	train	1 year ago

src	train	1 year ago

README.md	train	1 year ago

README_CN.md	train	1 year ago

eval.py	train	1 year ago

export.py	train	1 year ago

postprocess.py	train	1 year ago

preprocess.py	train	1 year ago

preprocess_dataset.py	train	1 year ago

requirement.txt	train	1 year ago

train.py	train	1 year ago

Soft-Masked BERT

paper：Zhang S, Huang H, Liu J, et al. Spelling error correction with soft-masked BERT[J]. arXiv preprint arXiv:2005.07421, 2020.

Model Architecture

Soft-masked BERT consists of a detection network based on BI-GRU and a correction network based on BERT. The probability of network prediction error is detected and the probability of network prediction error correction is corrected, while the detection network transmits the prediction results to the correction network by soft masking.

Dataset

Download SIGHAN dataset
Unpack the dataset above and copy all the ".sgml "files in the folder to the datasets/csc/directory
Copy 'sighan15_csc_testInt. TXT' and 'sighan15_csc_testtrut. TXT' to the datasets/csc/directory
[download] (https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml) to datasets/csc directory
Ensure that the following files are in datasets/csc

train.sgml
B1_training.sgml
C1_training.sgml
SIGHAN15_CSC_A2_Training.sgml
SIGHAN15_CSC_B2_Training.sgml
SIGHAN15_CSC_TestInput.txt
SIGHAN15_CSC_TestTruth.txt

Preprocess the data(Please refer to the requirement.txt installation for the dependency package required to run the script.)

bash scripts/run_preprocess.sh

Environment Requirements

Hardware（Ascend/GPU/CPU）
- Prepare hardware environment with Ascend/GPU/CPU processor.
Framework
- MindSpore
For more information, please check the resources below：
- MindSpore Tutorials
- MindSpore Python API
Dependencies
- The installation depends on
  
  pip install -r requirements.txt
version problem
- If the GLIBC version is too late, install an earlier version of openCC (e.g. 1.1.0).

Quick Start

Store preprocessed data in the datasets directory.
Download bert-base-chinese-vocab.txt and put it in the src/ file.
Downloadpre-trained model and put it in the weight/ file。
Execute the training script.

Train on offline servers

# Distributed training
bash scripts/run_distribute_train.sh [RANK_SIZE] [RANK_START_ID] [RANK_TABLE_FILE] [BERT_CKPT]
BERT_CKPT:Pre-trained BERT file name (for example bert_base.ckpt)

# Single training
bash scripts/run_standalone_train.sh [BERT_CKPT] [DEVICE_ID] [PYNATIVE]
BERT_CKPT:Pre-trained BERT file name (for example bert_base.ckpt)
DEVICE_ID:ID of the running machine
PYNATIVE:Whether to run in PYNATIVE mode (default False)

while the ModelArts for training (if you want to run on ModelArts, can refer to the following document [ModelArts] (https://support.huaweicloud.com/modelarts/))

# (1) Go to [code warehouse](https://git.openi.org.cn/OpenModelZoo/SoftMaskedBert) and create a training task.
# (2) Set "enable_modelarts=True; bert_ckpt=bert_base.ckpt"
# (3) If running in Pynative mode, set "pynative=True"
# (4) Set dataset "softmask.zip" on the web page
# (5) Set the startup file to "train.py"
# (6) run training task

Execute the evaluation script.

After the training, follow these steps to initiate the evaluation:

# assessment
bash scripts/run_eval.sh [BERT_CKPT_NAME] [CKPT_DIR]

Script description

├ ─ ─ model_zoo
├─ Readme.md // All model related instructions
├ ─ ─ soft - maksed - Bert
├─ Readme.md // Googlenet
├── Ascend310_infer // Implement 310 inference source code
├ ─ ─ scripts
│ ├─ Run_Train. Sh // Distributed to Ascend shell script
│ ├─ Run_eval. sh // Ascend evaluation shell script
│ ├─ Run_INFER_310.sh // Ascend Reasoning shell Script
├ ─ ─ the SRC
│ ├─ Soft Maksed Bert // Soft Maksed Bert
├─ Train.py //
├─ Eval. Py // Evaluation script
├─ Postprogress.py // 310 Reasoning Postprocessing script
├─ export.py // Checkpoint file export

├── model_zoo
    ├── README.md                          // All model related instructions
    ├── soft-maksed-bert
        ├── README.md                    // softmasked-BERT related instructions
        ├── README_CN.md             // softmasked-BERT related instructions in Chinese
        ├── ascend310_infer              // Implement 310 inference source code
        ├── scripts
        │   ├──run_distribute_train.sh             // Distributed to Ascend shell script
        │   ├──run_standalone_train.sh          // Ascend single machine training shell script
        │   ├──run_eval.sh                  // Ascend evaluation shell script
        │   ├──run_infer_310.sh         // Ascend inferences shell scripts
        │   ├──run_preprocess.sh      // Run a shell script for data preprocessing
        ├── src
        │   ├──soft_masked_bert.py           //  soft-maksed bert architecture
        │   ├──bert_model.py                    //  BERT architecture
        │   ├──dataset.py                          //   Data set processing
        │   ├──finetune_config.py             //   Model's hyperparameter
        │   ├──gru.py                               //   GRU architecture
        │   ├──tokenization.py                 //   Words tokenizer
        │   ├──util.py                               //   tools
        ├── train.py               // Training script
        ├── eval.py               // Evaluation of the script
        ├── postprogress.py       // 310 Inference postprocessing scripts
        ├── export.py            // Export the checkpoint file
        ├── preprocess_dataset.py            // Data preprocessing

Script parameters

'Batch size':36 # batch size
'epoch':100 # Total training epoch number
'Learning rate':0.0001 # Initial learning rate
'Loss function':'BCELoss' # Loss function used for training
'Optimizer ':AdamWeightDecay # Activate function

Training process

Single player training

Ascend runs in the processor environment

bash scripts/run_standalone_train.sh [BERT_CKPT] [DEVICE_ID] [PYNATIVE]

After the training, you can find the checkpoint file in the default scripts folder. The operation process is as follows:

Epoch: 1 Step: 152, loss is 3.3235654830932617
Epoch: 1 Step: 153, loss is 3.6958463191986084
Epoch: 1 Step: 154, loss is 3.585498571395874
Epoch: 1 Step: 155, loss is 3.276094913482666

Distributed training

Ascend runs in the processor environment

Bash run_distribute_train_smb.sh [RANK_SIZE] [RANK_START_ID] [RANK_TABLE_FILE] [BERT_CKPT]

The shell script above runs the distributed training in the background.

Epoch: 1 Step: 12, Loss is 7.957302093505859
Epoch: 1 Step: 13, loss is 7.886098861694336
Epoch: 1 Step: 14, Loss is 7.781495094299316
Epoch: 1 Step: 15, Loss is 7.755488395690918

Inference Process

Usage

Before performing inference, the mindir file must be exported by export.py. Input files must be in bin format.

# Ascend310 inference
bash scripts/run_infer_310.sh [MINDIR_PATH] [DATA_FILE_PATH] [NEED_PREPROCESS] [DEVICE_ID]

NEED_PREPROCESS means weather need preprocess or not, it's value is 'y' or 'n'.
DEVICE_ID is optional, default value is 0.

result

Inference result is saved in the project's main path, you can find result in acc.log file.

1 The detection result is precision=0.6733436055469953, recall=0.6181046676096181 and F1=0.6445427728613569
2 The correction result is precision=0.8260869565217391, recall=0.7234468937875751 and F1=0.7713675213675213
3 Sentence Level: acc:0.606364, precision:0.650970, recall:0.433579, f1:0.520487

Performance

Training Performance

Parameters	Ascend
Model Version	BERT-base
Resource	Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8
uploaded Date	2022-06-28
MindSpore版本	1.6.0
Dataset	SIGHAN
Training Parameters	epoch=100, steps=6994, batch_size = 36, lr=0.0001
Optimizer	AdamWeightDecay
Loss Function	BCELoss
Loss	0.0016
Speed	1p：300ms/step; 8p：306ms/step
Total time	1p：3497mins; 8p：446mins
Checkpoint for Fine tuning	459M (.ckpt文件)
Scripts	link

Inference Performance

Provide the detail of evaluation performance including latency, accuracy and so on.

e.g. you can reference the following template

Parameters	Ascend
Model Version	ResNet18
Resource	Ascend 910; OS Euler2.8
Uploaded Date	02/25/2021 (month/day/year)
MindSpore Version	1.7.0
Dataset	CIFAR-10
batch_size	32
outputs	probability
Accuracy	94.02%
Model for inference	43M (.air file)

Contributions

If you want to contribute, please review the contribution guidelines and how_to_contribute

Contributors

c34 (Huawei)

ModeZoo Homepage

Please check the official homepage.

No Description

mindspore

Python C++ Markdown Shell Text

How to access data resources in code