Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
zhaoyuming2 3f040dc61f | 3 years ago | |
---|---|---|
.. | ||
README.md | 3 years ago |
This is project for releasing some open-source natural language models from Joint Lab of BAAI and JDAI.
Different from other open-source Chinese NLP models, we mainly focus on some basic models for dialogue systems, especially in E-commerce domain.
Our corpus is very huge, currently we are using 42 GB Customer Service Dialogue Data (CSDD) for training, and it contain about 1.2 billion sentences.
We provide the pre-trained BERT and word embeddings. The charts below shows the data we use.
Task | Data Source | Sentences | Tokens | Vocabulary Size |
---|---|---|---|---|
Pre-Training | CSDD(Customer Service Dialog Data) | 1.2B | 9B | 1M |
The links to the models are here.
Model | Data Source | Link |
---|---|---|
BAAI-JDAI-BERT, Chinese | CSDD | JD-BERT for Tensorflow |
BAAI-JDAI-WordEmbedding | CSDD | JD-WORD-EMBEDDING with 300d |
JD-BERT.tar.gz file contains items:
|—— BAAI-JDAI-BERT
|—— bert_model.cpkt.* # pre-trained weights
|—— bert_config.json # hyperparamters of the model
|—— vocab.txt # vocabulary for WordPiece
|—— JDAI-BERT.md & INTRO.md # summary and details
JD-WORD-EMBEDDING.tar.gz file contains items:
|—— BAAI-JDAI-WORD-EMBEDDING
|—— JDAI-Word-Embedding.txt # word vectors, each line separated by whitespace
|—— JDAI-WORD-EMBEDDING.md & INTRO.md # summary and details
Masking | Dataset | Sentences | Training Steps | Device | Init Checkpoint | Init Lr |
---|---|---|---|---|---|---|
WordPiece | CSDD | 1.2B | 1M | P40×4 | BERTGoogle weight | 1e-4 |
<12-layer, 768-hidden, 12-heads, 110M parameters>, Chinese
, and bert_config.json
and vocab.txt
are identical to Google's original settings.Chinese Whole Word Masking(WWM)
in our current pre-training but use google's original WWM which is on the Chinese character level.We use Train
data of LCQMC(Large-scale Chinese Question Matching Corpus) and CSDQMC(Customer Service Dialog Question Matching Corpus) for fine-tuning, and then just train 2 epochs with a proper init learning rate 2e-5 on each dataset respectively.
Dataset | Train | Test | Domain | MaxLen | Batch Size | Epoch |
---|---|---|---|---|---|---|
LCQMC | 140K | 12.5K | Zhidao | 128 | 32 | 2 |
CSDQMC | 200K | 9K | Customer Service | 128 | 32 | 2 |
We evaluate our pre-trained model on the FAQ task with Test
data of LCQMC and CSDQMC.
Model | LCQMC | CSDQMC |
---|---|---|
ERNIE | 87.2 | - |
BERT | 86.9 | 85.1 |
BERT-wwm | 88.7 | 86.6 |
BAAI-JDAI-BERT | 88.6 | 87.5 |
We quote the BERT
and ERNIE
's results on LCQMC from the Chinese-BERT-wwm report
.
Window Size | Dynamic Window | Sub-sampling | Low-frequency Word | Iter | Negative Samplingfor SGNS | Dim |
---|---|---|---|---|---|---|
5 | Yes | 1e-5 | 10 | 10 | 5 | 300 |
We show top3 similar words for some sample words below. We use cosine distance to compute the distance of two words.
Input Word | 口红 | 西屋 | 花花公子 | 蓝月亮 | 联想 | 骆驼 |
---|---|---|---|---|---|---|
Similar 1 | 唇釉 | 典泛 | PLAYBOY | 威露士 | 宏碁 | CAMEL |
Similar 2 | 唇膏 | 法格 | 富贵鸟 | 增艳型 | 15IKB | 骆驼牌 |
Similar 3 | 纪梵希 | HS1250 | 霸王车 | 奥妙 | 14IKB | 健足乐 |
该项目开源了一些自然语言处理的预训练模型。该项目主要关注对话系统的一些基础模型,尤其是电子商务领域。该项目使用 42 GB 的客户服务对话数据 (大约包含 12 亿个句子) 进行训练,并开源了训练好的BERT模型和词嵌入模型。
Python Shell other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》