Browse Source

update code

master
zhangy03 1 month ago
parent
commit
8a05d84806
13 changed files with 22 additions and 142 deletions
  1. +13
    -0
      README.md
  2. +1
    -14
      dataset.py
  3. +0
    -15
      generate.py
  4. +1
    -14
      pangu_alpha.py
  5. +1
    -14
      pangu_alpha_config.py
  6. +1
    -14
      pangu_alpha_predict.py
  7. +1
    -14
      pangu_alpha_train.py
  8. +0
    -15
      pangu_alpha_wrapcell.py
  9. +1
    -14
      run_pangu_alpha_predict.py
  10. +1
    -14
      run_pangu_alpha_train.py
  11. BIN
      serving_demo/PanGu-Alpha-serving-demo.avi
  12. +1
    -0
      tokenization_jieba.py
  13. +1
    -14
      utils.py

+ 13
- 0
README.md View File

@@ -3,6 +3,7 @@
「盘古α」由以鹏城实验室为首的技术团队联合攻关,首次基于“鹏城云脑Ⅱ”和国产MindSpore框架的自动混合并行模式实现在2048卡算力集群上的大规模分布式训练,训练出业界首个2000亿参数以中文为核心的预训练生成语言模型。盘古α预训练模型支持丰富的场景应用,在知识问答、知识检索、知识推理、阅读理解等文本生成领域表现突出,具备很强的小样本学习能力。
[[技术报告](https://git.openi.org.cn/PCL-Platform.Intelligence/PanGu-AIpha/src/branch/master/PANGU-%ce%b1.pdf)]
[[模型下载](#模型下载)]
[[MindSpore大规模分布式自动并行框架](https://mindspore.cn/)]
[[评测数据集下载](https://git.openi.org.cn/PCL-Platform.Intelligence/Chinese_WPLC)]
[[serving展示视频下载](#serving展示视频下载)]

@@ -17,18 +18,23 @@
### 数据集

海量语料是预训练模型研究的基础,联合团队从开源开放数据集、common crawl网页数据、电子书等收集了近80TB原始数据。

<img src="./docs/dataset.png" width="700" height="260"/><br/>

搭建了面向大型语料库预处理的分布式集群,通过数据清洗过滤、去重、质量评估等处理流程,构建了一个约1.1TB的高质量中文语料数据集,经统计Token数量约为250B规模。通过对不同的开源数据集独立进行处理,完全清除了跟下游任务相关的标签信息,以保证源数据的无偏性。

### 模型结构

<img src="./docs/model.png" width="850" height="420"/><br/>

query层堆叠在transformer层之上。query层的基本结构与transformer层相似,只是引入了一个额外的Query layer,来预测生成下一个query Q的位置。

### MindSpore超大规模自动并行

大集群下高效训练千亿至万亿参数模型,用户需要综合考虑参数量、计算量、计算类型、集群带宽拓扑和样本数量等才能设计出性能较优的并行切分策略,模型编码出来考虑算法以外,还需要编写大量并行切分和通信代码。

<img src="./docs/Pipline.png" width="950" height="310"/><br/>

MindSpore是业界首个支持全自动并行的框架,MindSpore多维度自动并行,通过数据并行、算子级模型并行、Pipeline模型并行、优化器模型并行、异构并行、重计算、高效内存复用,及拓扑感知调度,实现整体迭代时间最小(计算时间+通信时间)。编程接口高效易用,实现了算法逻辑和并行逻辑解耦,串行代码自动分布式并行。

| 硬件平台 |设备数量 | 操作系统 | 集群管理 |框架 |
@@ -61,11 +67,17 @@ MindSpore是业界首个支持全自动并行的框架,MindSpore多维度自
### 下游任务评估

为了评估模型性能,团队收集了16个不同类型的中文下游任务,如下图所示:

<img src="./docs/task.png" width="800" height="350"/><br/>

由于中文缺少在小样本学习领域的benchMark,研究对比了智源研究院发布的首个26亿参数的中文预训练语言模型「悟道·文源」CPM,通过在1.1TB数据中策略抽样了100GB等量数据集训练了2.6B参数规模的「盘古α」模型,并在已收集的16个下游任务上进行了对比,结果如下表所示:

<img src="./docs/2.6B.png" width="800" height="350"/><br/>

实验结果表明盘古α-2.6B比CPM-2.6B模型具有更强的语言学习能力,特别是在小样本学习和生成任务方面。在生成任务方面, 盘古α-2.6B比CPM-2.6B性能指标平均提升6.5个百分点。在PPL任务方面,盘古α-2.6B在OCNLI、TNEWS和IFLYTEK任务上略弱于CPM。这一现象归因于模型使用了更大规模的词表,这使得模型在局部文本变化时对困惑度不敏感。

<img src="./docs/13B.png" width="800" height="350"/><br/>

团队还对比了盘古α-13B和盘古α-2.6B模型在这些下游任务上的表现,在所有的生成任务和大部分的PPL任务上,13B的模型性能优于2.6B。在CMRC2018、DRCD和WebQA任务中,小样本学习比零样本学习指标高出10分以上,说明盘古α-13B模型具有较强的小样本学习能力。在NLI和文本分类任务上,盘古α-13B与盘古α-2.6B的性能相当,这些任务对于生成语言模型来说通常是困难的,而且模型改进的机会很大,这将是团队未来的工作。盘古α-200B的模型文件大小在TB级别,下游任务推理耗时耗力,还有更多优化加速的空间,团队正在共同努力完成推理和测评,尽快同步相关研究成果。
部分生成举例:

@@ -134,5 +146,6 @@ Generate2:飞云:咳,年轻人说的话要有选择性,我既然说了我
### 项目信息

鹏城实验室和北京大学等相关单位是盘古α联合开发团队的主要成员。

<img src="./docs/logos.png" width="266" height="132"/><br/>


+ 1
- 14
dataset.py View File

@@ -1,21 +1,8 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
Create dataset for training and evaluting
"""


import os
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as C


+ 0
- 15
generate.py View File

@@ -1,18 +1,3 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================

"""
TopK for text generation
"""


+ 1
- 14
pangu_alpha.py View File

@@ -1,18 +1,5 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""PANGUALPHA model"""

import math
import numpy as np
import os


+ 1
- 14
pangu_alpha_config.py View File

@@ -1,20 +1,7 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
network config setting
"""

import mindspore.common.dtype as mstype




+ 1
- 14
pangu_alpha_predict.py View File

@@ -1,21 +1,8 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
PANGUALPHA train script
"""


import os
import numpy as np
import time


+ 1
- 14
pangu_alpha_train.py View File

@@ -1,21 +1,8 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
PanGu train script
"""


import os
import math
from pathlib2 import Path


+ 0
- 15
pangu_alpha_wrapcell.py View File

@@ -1,20 +1,5 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""PANGUALPHA training wrapper"""


import mindspore.nn as nn
from mindspore.ops import operations as P
from mindspore.ops import composite as C


+ 1
- 14
run_pangu_alpha_predict.py View File

@@ -1,20 +1,7 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
PanGu predict run
"""

import argparse
from pangu_alpha_config import PANGUALPHAConfig, set_parse
from pangu_alpha_predict import run_predict


+ 1
- 14
run_pangu_alpha_train.py View File

@@ -1,20 +1,7 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
PanGu predict run
"""

import argparse
from pangu_alpha_config import PANGUALPHAConfig, set_parse
from pangu_alpha_train import run_train


BIN
serving_demo/PanGu-Alpha serving demo.avi → serving_demo/PanGu-Alpha-serving-demo.avi View File


+ 1
- 0
tokenization_jieba.py View File

@@ -13,6 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for OpenAI GPT."""

from __future__ import (absolute_import, division, print_function,
unicode_literals)



+ 1
- 14
utils.py View File

@@ -1,20 +1,7 @@
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
network config setting, gradient clip function and dynamic learning rate function
"""

import numpy as np
from multiprocessing import Process
import mindspore.nn as nn


Loading…
Cancel
Save