History

Zeyu Chen 79eaa2b2fd Update README.md		2 years ago
..
README.md	Update README.md	2 years ago

data.py	fix dataloader trans_fn bug in 2.1.0 (#188)	3 years ago

train.py	Update chnsenticorp examples for qianyan dataset modification(#485)	2 years ago

README.md

Word Embedding with PaddleNLP

Word Embedding with PaddleNLP

简介

PaddleNLP已预置多个公开的预训练Embedding，用户可以通过使用paddlenlp.embeddings.TokenEmbedding接口加载预训练Embedding，从而提升训练效果。以下通过基于开源情感倾向分类数据集ChnSentiCorp的文本分类训练例子展示paddlenlp.embeddings.TokenEmbedding对训练提升的效果。更多的paddlenlp.embeddings.TokenEmbedding用法，请参考TokenEmbedding 接口使用指南。

快速开始

环境依赖

visualdl

安装命令：pip install visualdl

启动训练

我们以中文情感分类公开数据集ChnSentiCorp为示例数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在验证集（dev.tsv）验证。训练时会自动下载词表dict.txt，用于对数据集进行切分，构造数据样本。

启动训练：

# 使用paddlenlp.embeddings.TokenEmbedding
python train.py --device='gpu' \
                --lr=5e-4 \
                --batch_size=64 \
                --epochs=20 \
                --use_token_embedding=True \
                --vdl_dir='./vdl_dir'

# 使用paddle.nn.Embedding
python train.py --device='gpu' \
                --lr=1e-4 \
                --batch_size=64 \
                --epochs=20 \
                --use_token_embedding=False \
                --vdl_dir='./vdl_dir'

以上参数表示：

device: 选择训练设备，目前可选'gpu', 'cpu', 'xpu'。默认为gpu。
lr: 学习率，默认为5e-4。
batch_size: 运行一个batch大小，默认为64。
epochs: 训练轮次，默认为5。
use_token_embedding: 是否使用paddlenlp.embeddings.TokenEmbedding，默认为True。
vdl_dir: VisualDL日志目录。训练过程中的VisualDL信息会在该目录下保存。默认为./vdl_dir

该脚本还提供以下参数：

save_dir: 模型保存目录。默认值为"./checkpoints/"。
init_from_ckpt: 恢复模型训练的断点路径。默认值为None，表示不恢复训练。
embedding_name: 预训练Embedding名称，默认为w2v.baidu_encyclopedia.target.word-word.dim300. 支持的预训练Embedding可参考Embedding 模型汇总。

注意：

程序运行时将会自动进行训练，评估，测试。同时训练过程中会自动保存模型在指定的save_dir中。训练过程中会实时保存每个epoch的模型参数，并以当前epoch值命名。如第2个Epochs，模型参数会被保存为./checkpoints/2.pdparams，优化器状态保存为./checkpoints/2.pdopt。

如：

checkpoints/
├── 0.pdopt
├── 0.pdparams
├── 1.pdopt
├── 1.pdparams
├── ...
└── final.pdparams

如需恢复模型训练，则init_from_ckpt只需指定到文件名即可，不需要添加文件尾缀。如果用户想热启第10个Epoch保存的模型，则设置 --init_from_ckpt=./checkpoints/10即可，程序会自动加载模型参数./checkpoints/10.pdparams，也会自动加载优化器状态./checkpoints/10.pdopt。

启动VisualDL

推荐使用VisualDL查看实验对比。以下为VisualDL的启动命令，其中logdir参数指定的目录需要与启动训练时指定的vdl_dir相同。（更多VisualDL的用法，可参考VisualDL使用指南）

visualdl --logdir ./vdl_dir --port 8888 --host 0.0.0.0

训练效果对比

在Chrome浏览器输入 ip:8888 (ip为启动VisualDL机器的IP)。

以下为示例实验效果对比图，蓝色是使用paddlenlp.embeddings.TokenEmbedding进行的实验，绿色是使用没有加载预训练模型的Embedding进行的实验。
可以看到，使用paddlenlp.embeddings.TokenEmbedding的训练，其验证acc变化趋势上升，并收敛于0.90左右，收敛后相对平稳，不容易过拟合。
而没有使用paddlenlp.embeddings.TokenEmbedding的训练，其验证acc变化趋势向下，并收敛于0.86左右。从示例实验可以观察到，使用paddlenlp.embedding.TokenEmbedding能提升训练效果。

Eval Acc：

	Best Acc
paddle.nn.Embedding	0.8965
paddelnlp.embeddings.TokenEmbedding	0.9082

致谢

感谢 Chinese-Word-Vectors提供Word2Vec中文Embedding预训练模型，GloVe Project提供的GloVe英文Embedding预训练模型，FastText Project提供的fasttext英文预训练模型。

参考论文

Li, Shen, et al. "Analogical reasoning on chinese morphological and semantic relations." arXiv preprint arXiv:1805.06504 (2018).
Qiu, Yuanyuan, et al. "Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings." Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations

黑客松task_55,在PaddleNLP的Roberta中，新增 MultipleChoice，MaskedLM 和 CausalLM三个类，7个模型权重. ，新增BPETokenizer

Python C++ Cuda Text Shell other

fangzeyang0904@hotmail.com chenzeyu01@baidu.com 380185688@qq.com yyb0576@163.com 397551318@qq.com zhonghui.net@gmail.com 709153940@qq.com 33639025+smallv0221@users.noreply.github.com zhoushunjie@baidu.com liujiaqi06@baidu.com 623543001@qq.com tianxin04@baidu.com kinghuin_chull@163.com 48793257+Steffy-zxf@users.noreply.github.com gongel@qq.com whucsgs@163.com

hebiancaozhu@126.com 40840292+linjieccc@users.noreply.github.com 53830712+huhuiwen99@users.noreply.github.com 71377852+xiemoyuan@users.noreply.github.com 50394665+JunnYu@users.noreply.github.com zhangxuefei@baidu.com liyueyang01@baidu.com xiemoyuan@baidu.com

How to access data resources in code